On Thu, Jun 8, 2023 at 6:18 PM Andres Freund <and...@anarazel.de> wrote:
> Could you get a profile with call graphs? We need to know what leads to all > those osq_lock calls. > perf record --call-graph dwarf -a sleep 1 > or such should do the trick, if run while the workload is running. > I'm doing something wrong because I can't find the slow part in the perf data; I'll get back to you on this one. > I think it's unwise to compare builds of such different vintage. The > compiler > options and compiler version can have substantial effects. > I recommend also using -P1. Particularly when using unix sockets, the > specifics of how client threads and server threads are scheduled plays a > huge > role. Fair suggestions, those graphs come out of pgbench-tools where I profile all the latency, fast results for me are ruler flat. It's taken me several generations of water cooling experiments to reach that point, but even that only buys me 10 seconds before I can overload a CPU to higher latency with tougher workloads. Here's a few seconds of slightly updated examples, now with matching PGDG sourced 14+15 on the 5950X and with sched_autogroup_enabled=0 too: $ pgbench -S -T 10 -c 32 -j 32 -M prepared -p 5434 -P 1 pgbench pgbench (14.8 (Ubuntu 14.8-1.pgdg23.04+1)) progress: 1.0 s, 1032929.3 tps, lat 0.031 ms stddev 0.004 progress: 2.0 s, 1051239.0 tps, lat 0.030 ms stddev 0.001 progress: 3.0 s, 1047528.9 tps, lat 0.030 ms stddev 0.008... $ pgbench -S -T 10 -c 32 -j 32 -M prepared -p 5432 -P 1 pgbench pgbench (15.3 (Ubuntu 15.3-1.pgdg23.04+1)) progress: 1.0 s, 171816.4 tps, lat 0.184 ms stddev 0.029, 0 failed progress: 2.0 s, 173501.0 tps, lat 0.184 ms stddev 0.024, 0 failed... On the slow runs it will even do this, watch my 5950X accomplish 0 TPS for a second! progress: 38.0 s, 177376.9 tps, lat 0.180 ms stddev 0.039, 0 failed progress: 39.0 s, 35861.5 tps, lat 0.181 ms stddev 0.032, 0 failed progress: 40.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed progress: 41.0 s, 222.1 tps, lat 304.500 ms stddev 741.413, 0 failed progress: 42.0 s, 101199.6 tps, lat 0.530 ms stddev 18.862, 0 failed progress: 43.0 s, 98286.9 tps, lat 0.328 ms stddev 8.156, 0 failed Gonna have to measure seconds/transaction if this gets any worse. > I've seen such issues in the past, primarily due to contention internal to > cgroups, when the memory controller is enabled. IIRC that could be > alleviated > to a substantial degree with cgroup.memory=nokmem. > I cannot express on-list how much I dislike everything about the cgroups code. Let me dig up the right call graph data first and will know more then. The thing that keeps me from chasing kernel tuning too hard is seeing the PG14 go perfectly every time. This is a really weird one. All the suggestions much appreciated.