Hello Andres,

On 2015-08-12 22:34:59 +0200, Fabien COELHO wrote:
    sort/flush : tps avg & stddev (percent of time beyond 10.0 tps)
     on   on   : 631 +- 131 (0.1%)
     on   off  : 564 +- 303 (12.0%)
     off  on   : 167 +- 315 (76.8%) # stuck...
     off  off  : 177 +- 305 (71.2%) # ~ current pg

What exactly do you mean with 'stuck'?

I mean that the during the I/O storms induced by the checkpoint pgbench sometimes get stuck, i.e. does not report its progression every second (I run with "-P 1"). This occurs when sort is off, either with or without flush, for instance an extract from the off/off medium run:

 progress: 573.0 s, 5.0 tps, lat 933.022 ms stddev 83.977
 progress: 574.0 s, 777.1 tps, lat 7.161 ms stddev 37.059
 progress: 575.0 s, 148.9 tps, lat 4.597 ms stddev 10.708
 progress: 814.4 s, 0.0 tps, lat -nan ms stddev -nan
 progress: 815.0 s, 0.0 tps, lat -nan ms stddev -nan
 progress: 816.0 s, 0.0 tps, lat -nan ms stddev -nan
 progress: 817.0 s, 0.0 tps, lat -nan ms stddev -nan
 progress: 818.0 s, 0.0 tps, lat -nan ms stddev -nan
 progress: 819.0 s, 0.0 tps, lat -nan ms stddev -nan
 progress: 820.0 s, 0.0 tps, lat -nan ms stddev -nan
 progress: 821.0 s, 0.0 tps, lat -nan ms stddev -nan
 progress: 822.0 s, 0.0 tps, lat -nan ms stddev -nan
 progress: 823.0 s, 0.0 tps, lat -nan ms stddev -nan
 progress: 824.0 s, 0.0 tps, lat -nan ms stddev -nan
 progress: 825.0 s, 0.0 tps, lat -nan ms stddev -nan
 progress: 826.0 s, 0.0 tps, lat -nan ms stddev -nan

There is a 239.4 seconds gap in pgbench output. This occurs from time to time and may represent a significant part of the run, and I count these "stuck" times as 0 tps. Sometimes pgbench is stuck performance wise but manages nevetheless to report a "0.0 tps" every second, as above after it unstuck.

The actual origin of the issue with a stuck client (pgbench, libpq, OS, postgres...) is unclear to me, but the whole system does not behave well under an I/O storm anyway, and I have not succeeded in understanding where pgbench is stuck when it does not report its progress. I tried some runs with gdb but it did not get stuck and reported a lot of "0.0 tps" during the storms.


Here are a few more figures with the v8 version of the patch, on a host with 8 cores, 16 GB, RAID 1 HDD, under Ubuntu precise. I already reported the medium case, and the small case turned afterwards.

  small postgresql.conf:
    shared_buffers = 2GB
    checkpoint_timeout = 300s # this is the default
    checkpoint_completion_target = 0.8
    # initialization: pgbench -i -s 120

  medium postgresql.conf: ## ALREADY REPORTED
    shared_buffers = 4GB
    checkpoint_timeout = 15min
    checkpoint_completion_target = 0.8
    max_wal_size = 4GB
    # initialization: pgbench -i -s 250

  warmup> pgbench -T 1200 -M prepared -S -j 2 -c 4

  # 400 tps throttled test
  sh> pgbench -M prepared -N -P 1 -T 4000 -R 400 -L 100 -j 2 -c 4

      options  / percent of skipped/late transactions
    sort/flush /   small  medium
     on   on   :    3.5    2.7
     on   off  :   24.6   16.2
     off  on   :   66.1   68.4
     off  off  :   63.2   68.7

  # 200 tps throttled test
  sh> pgbench -M prepared -N -P 1 -T 4000 -R 200 -L 100 -j 2 -c 4

      options  / percent of skipped/late transactions
    sort/flush /   small  medium
     on   on   :    1.9    2.7
     on   off  :   14.3    9.5
     off  on   :   45.6   47.4
     off  off  :   47.4   48.8

  # 100 tps throttled test
  sh> pgbench -M prepared -N -P 1 -T 4000 -R 100 -L 100 -j 2 -c 4

      options  / percent of skipped/late transactions
    sort/flush /   small  medium
     on   on   :    0.9    1.8
     on   off  :    9.3    7.9
     off  on   :    5.0   13.0
     off  off  :   31.2   31.9

  # full speed 1 client
  sh> pgbench -M prepared -N -P 1 -T 4000

      options  / tps avg & stddev (percent of time below 10.0 tps)
    sort/flush /    small              medium
     on   on   : 564 +- 148 ( 0.1%)   631 +- 131 ( 0.1%)
     on   off  : 470 +- 340 (21.7%)   564 +- 303 (12.0%)
     off  on   : 157 +- 296 (66.2%)   167 +- 315 (76.8%)
     off  off  : 154 +- 251 (61.5%)   177 +- 305 (71.2%)

  # full speed 2 threads 4 clients
  sh> pgbench -M prepared -N -P 1 -T 4000 -j 2 -c 4

      options  / tps avg & stddev (percent of time below 10.0 tps)
    sort/flush /    small              medium
     on   on   : 757 +- 417 ( 0.1%)  1058 +- 455 ( 0.1%)
     on   off  : 752 +- 893 (48.4%)  1056 +- 942 (32.8%)
     off  on   : 173 +- 521 (83.0%)   170 +- 500 (88.3%)
     off  off  : 199 +- 512 (82.5%)   209 +- 506 (82.0%)

In all cases, the "sort on & flush on" provides the best results, with tps speedup from 3-5, and overall high responsiveness (& lower latency).

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to