Hi,

While running some benchmarks comparing 17 and 18, I ran into a simple
workload where 18 throughput drops by ~80%. After pulling my hair for a
couple hours I realized the change that triggered this is 04bec894a04c,
which set checksums on by default. Which is very bizarre, because the
workload is read-only and fits into shared buffers.

I've only observed this on large machines with 96+ cores, on azure, both
with Intel (Xeon 8370C) and AMD (EPYC 9V33X). I've not been successful
in reproducing it on the puny machines I have at home.

Let me demonstrate the issue.

1) Create a cluster, with an increased connection limit:

  pg_ctl -D data init
  echo 'max_connections = 100' >> data/postgresql.conf
  pg_ctl -D data -l pg.log start

Now the benchmark itself - it's fairly trivial, regular pgbench on scale
1, with an extra index on "pgbench_accounts.bid" column:

  createdb test
  pgbench -i -s 1 test
  psql test -c "create index on pgbench_accounts (bid)"

and a script with a simple query using the index

  select count(*) from pgbench_accounts where bid = 0

Cool, now let's get some numbers for 32-160 clients:

  for c in 32 64 96 128 160; do

    pgbench -n -f select.sql -T 10 -M prepared -c $c -j $c test | grep
'tps';

  done;


Which produces this:

  tps = 752464.727387 (without initial connection time)

  tps = 1062150.015297 (without initial connection time)

  tps = 572205.386159 (without initial connection time)

  tps = 568579.663980 (without initial connection time)

  tps = 561360.493639 (without initial connection time)

Clearly, at 96 clients the throughput just tanks. Now let's disable
checksums on the cluster:

  pg_ctl -D data -l pg.log stop

  pg_checksums --disable data

  pg_ctl -D data -l pg.log start


and run the script again

  tps = 753484.468817 (without initial connection time)

  tps = 1083752.631856 (without initial connection time)

  tps = 1862008.466802 (without initial connection time)

  tps = 1826484.489433 (without initial connection time)

  tps = 1818400.279069 (without initial connection time)


Clearly, the throughput does not drop, and it's ~3.5x higher. This is
from the Xeon machine, but I've seen the same thing on the EPYC boxes.
The impact varies, but in general it's 70-80%.

I'm not suggesting this is caused by 04bec894a04c, or even specific to
PG 18. I see the same issue on 17, except that 17 does not enable
checksums by default. For example on EPYC 9V74 the 17 does this:

   32   762187.724859
   64  1284731.279766
   96  2978828.264373
  128  2991561.835178
  160  2971689.161136

and with checksums

   32   874266.549870
   64  1286382.426281
   96   569647.384735
  128   562128.010301
  160   561826.908181

So, similar regression ...


I find this quite bizarre / puzzling, because this is a read-only
workload, with absolutely no writes, and tiny data set (~15MB), i.e.
everything fits into shared buffers. Why would that be affected by
checksums at all?

I spent some time profiling this, without much success. This is what I
get from perf top:

 Samples: 6M of event 'cycles:P', 4000 Hz, Event count (approx.):
5270683795302 lost: 0/0 drop: 0/0
 Overhead  Shared Object         Symbol
   50.94%  postgres              [.] pg_atomic_read_u32_impl
   17.32%  postgres              [.] pg_atomic_compare_exchange_u32_impl
   10.17%  postgres              [.] spin_delay
    5.83%  postgres              [.] pg_atomic_fetch_or_u32_impl
    1.64%  postgres              [.] pg_atomic_compare_exchange_u32_impl
    1.20%  postgres              [.] BufferDescriptorGetBuffer
    0.92%  postgres              [.] perform_spin_delay

and the report with backtraces says most of the time is spent here:

  --97.00%--btgettuple
        |
         --96.98%--_bt_first
              |
              |--48.82%--_bt_readfirstpage
              |     |
              |     |--44.57%--_bt_steppage
              |     |     |
              |     |      --44.45%--ReleaseBuffer
              |     |            |
              |     |             --44.43%--UnpinBuffer
              |     |                       |
              |     |                        ...
              |     |...
              |
               --48.11%--_bt_search
                    |

                      --47.89%--_bt_relandgetbuf



The atomic ops come from pinning/unpinning buffers. I realize it's
possible it gets much more expensive under concurrency (the clients
simply have to compete when updating the same counter, and with enough
clients there'll be more conflicts and retries). Kinda unfortunate, and
maybe we should do something about it, not sure.

But why would it depend on checksums at all? This read-only test should
be entirely in-memory, so how come it's affected?


regards

-- 
Tomas Vondra



Reply via email to