On 2017-02-25 00:40, Petr Jelinek wrote:

0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
0002-Fix-after-trigger-execution-in-logical-replication.patch
0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
snapbuild-v3-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch
snapbuild-v3-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
snapbuild-v3-0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
snapbuild-v3-0004-Skip-unnecessary-snapshot-builds.patch
0001-Logical-replication-support-for-initial-data-copy-v6.patch

Here are some results. There is improvement although it's not an unqualified success.

Several repeat-runs of pgbench_derail2.sh, with different parameters for number-of-client yielded an output file each.

Those show that logrep is now pretty stable when there is only 1 client (pgbench -c 1). But it starts making mistakes with 4, 8, 16 clients. I'll just show a grep of the output files; I think it is self-explicatory:

Output-files (lines counted with  grep | sort | uniq -c):

-- out_20170225_0129.txt
    250 -- pgbench -c 1 -j 8 -T 10 -P 5 -n
    250 -- All is well.

-- out_20170225_0654.txt
     25 -- pgbench -c 4 -j 8 -T 10 -P 5 -n
     24 -- All is well.
      1 -- Not good, but breaking out of wait (waited more than 60s)

-- out_20170225_0711.txt
     25 -- pgbench -c 8 -j 8 -T 10 -P 5 -n
     23 -- All is well.
      2 -- Not good, but breaking out of wait (waited more than 60s)

-- out_20170225_0803.txt
     25 -- pgbench -c 16 -j 8 -T 10 -P 5 -n
     11 -- All is well.
     14 -- Not good, but breaking out of wait (waited more than 60s)

So, that says:
1 clients: 250x success, zero fail (250 not a typo, ran this overnight)
4 clients: 24x success, 1 fail
8 clients: 23x success, 2 fail
16 clients: 11x success, 14 fail

I want to repeat what I said a few emails back: problems seem to disappear when a short wait state is introduced (directly after the 'alter subscription sub1 enable' line) to give the logrep machinery time to 'settle'. It makes one think of a timing error somewhere (now don't ask me where..).

To show that, here is pgbench_derail2.sh output that waited 10 seconds (INIT_WAIT in the script) as such a 'settle' period works faultless (with 16 clients):

-- out_20170225_0852.txt
     25 -- pgbench -c 16 -j 8 -T 10 -P 5 -n
     25 -- All is well.

QED.

(By the way, no hanged sessions so far, so that's good)


thanks

Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to