Hi, While running some benchmarks comparing 17 and 18, I ran into a simple workload where 18 throughput drops by ~80%. After pulling my hair for a couple hours I realized the change that triggered this is 04bec894a04c, which set checksums on by default. Which is very bizarre, because the workload is read-only and fits into shared buffers.
I've only observed this on large machines with 96+ cores, on azure, both with Intel (Xeon 8370C) and AMD (EPYC 9V33X). I've not been successful in reproducing it on the puny machines I have at home. Let me demonstrate the issue. 1) Create a cluster, with an increased connection limit: pg_ctl -D data init echo 'max_connections = 100' >> data/postgresql.conf pg_ctl -D data -l pg.log start Now the benchmark itself - it's fairly trivial, regular pgbench on scale 1, with an extra index on "pgbench_accounts.bid" column: createdb test pgbench -i -s 1 test psql test -c "create index on pgbench_accounts (bid)" and a script with a simple query using the index select count(*) from pgbench_accounts where bid = 0 Cool, now let's get some numbers for 32-160 clients: for c in 32 64 96 128 160; do pgbench -n -f select.sql -T 10 -M prepared -c $c -j $c test | grep 'tps'; done; Which produces this: tps = 752464.727387 (without initial connection time) tps = 1062150.015297 (without initial connection time) tps = 572205.386159 (without initial connection time) tps = 568579.663980 (without initial connection time) tps = 561360.493639 (without initial connection time) Clearly, at 96 clients the throughput just tanks. Now let's disable checksums on the cluster: pg_ctl -D data -l pg.log stop pg_checksums --disable data pg_ctl -D data -l pg.log start and run the script again tps = 753484.468817 (without initial connection time) tps = 1083752.631856 (without initial connection time) tps = 1862008.466802 (without initial connection time) tps = 1826484.489433 (without initial connection time) tps = 1818400.279069 (without initial connection time) Clearly, the throughput does not drop, and it's ~3.5x higher. This is from the Xeon machine, but I've seen the same thing on the EPYC boxes. The impact varies, but in general it's 70-80%. I'm not suggesting this is caused by 04bec894a04c, or even specific to PG 18. I see the same issue on 17, except that 17 does not enable checksums by default. For example on EPYC 9V74 the 17 does this: 32 762187.724859 64 1284731.279766 96 2978828.264373 128 2991561.835178 160 2971689.161136 and with checksums 32 874266.549870 64 1286382.426281 96 569647.384735 128 562128.010301 160 561826.908181 So, similar regression ... I find this quite bizarre / puzzling, because this is a read-only workload, with absolutely no writes, and tiny data set (~15MB), i.e. everything fits into shared buffers. Why would that be affected by checksums at all? I spent some time profiling this, without much success. This is what I get from perf top: Samples: 6M of event 'cycles:P', 4000 Hz, Event count (approx.): 5270683795302 lost: 0/0 drop: 0/0 Overhead Shared Object Symbol 50.94% postgres [.] pg_atomic_read_u32_impl 17.32% postgres [.] pg_atomic_compare_exchange_u32_impl 10.17% postgres [.] spin_delay 5.83% postgres [.] pg_atomic_fetch_or_u32_impl 1.64% postgres [.] pg_atomic_compare_exchange_u32_impl 1.20% postgres [.] BufferDescriptorGetBuffer 0.92% postgres [.] perform_spin_delay and the report with backtraces says most of the time is spent here: --97.00%--btgettuple | --96.98%--_bt_first | |--48.82%--_bt_readfirstpage | | | |--44.57%--_bt_steppage | | | | | --44.45%--ReleaseBuffer | | | | | --44.43%--UnpinBuffer | | | | | ... | |... | --48.11%--_bt_search | --47.89%--_bt_relandgetbuf The atomic ops come from pinning/unpinning buffers. I realize it's possible it gets much more expensive under concurrency (the clients simply have to compete when updating the same counter, and with enough clients there'll be more conflicts and retries). Kinda unfortunate, and maybe we should do something about it, not sure. But why would it depend on checksums at all? This read-only test should be entirely in-memory, so how come it's affected? regards -- Tomas Vondra