Hi, On 2023-06-08 15:08:57 -0400, Gregory Smith wrote: > Pushing SELECT statements at socket speeds with prepared statements is a > synthetic benchmark that normally demos big pgbench numbers. My benchmark > farm moved to Ubuntu 23.04/kernel 6.2.0-20 last month, and that test is > badly broken on the system PG15 at larger core counts, with as much as an > 85% drop from expectations. Since this is really just a benchmark workload > the user impact is very narrow, probably zero really, but as the severity > of the problem is high we should get to the bottom of what's going on.
> First round of profile data suggests the lost throughput is going here: > Overhead Shared Object Symbol > 74.34% [kernel] [k] osq_lock > 2.26% [kernel] [k] mutex_spin_on_owner Could you get a profile with call graphs? We need to know what leads to all those osq_lock calls. perf record --call-graph dwarf -a sleep 1 or such should do the trick, if run while the workload is running. > Quick test to find if you're impacted: on the server and using sockets, > run a 10 second SELECT test with/without preparation using 1 or 2 > clients/[core|thread] and see if preparation is the slower result. Here's > a PGDG PG14 on port 5434 as a baseline, next to Ubuntu 23.04's regular > PG15, all using the PG15 pgbench on AMD 5950X: I think it's unwise to compare builds of such different vintage. The compiler options and compiler version can have substantial effects. > $ pgbench -i -s 100 pgbench -p 5434 > $ pgbench -S -T 10 -c 32 -j 32 -M prepared -p 5434 pgbench > pgbench (14.8 (Ubuntu 14.8-1.pgdg23.04+1)) > tps = 1058195.197298 (without initial connection time) I recommend also using -P1. Particularly when using unix sockets, the specifics of how client threads and server threads are scheduled plays a huge role. How large a role can change significantly between runs and between fairly minor changes to how things are executed (e.g. between major PG versions). E.g. on my workstation (two sockets, 10 cores/20 threads each), with 32 clients, performance changes back and forth between ~600k and ~850k. Whereas with 42 clients, it's steadily at 1.1M, with little variance. I also have seen very odd behaviour on larger machines when /proc/sys/kernel/sched_autogroup_enabled is set to 1. > There's been plenty of recent chatter on LKML about *osq_lock*, in January > Intel reported a 20% benchmark regression on UnixBench that might be > related. Work is still ongoing this week: I've seen such issues in the past, primarily due to contention internal to cgroups, when the memory controller is enabled. IIRC that could be alleviated to a substantial degree with cgroup.memory=nokmem. Greetings, Andres Freund