Having little energy boost after meeting some VSP (Very Smart People) on recent PGConfEu, I've attempted to pursue optimization about apparently minor (?) inefficiency that I've spotted while researching extremely high active backends on extreme max_connections (measured in thousands active backends all on runq). Classic explanations are known: context switching, ProcarrayGroupUpdate, yada yada yada, and it still was fun to investigate. Yet after 0e141c0fbb2, the ProcArrayGroupClearXid() can be performing dozens of repetitive calls to PGSemaphoreUnlock()->futex(). In such extreme situation, each syscall is context switch and opportunity for the scheduler to pick up different PID to run on that VCPU - I think I was able to verify this via `perf sched map` if my (limited) interpretation (using timestamps) was accurate this has real chance to occur. Linux on it's own does not have vectorized futex unlock syscall like "futex_wakeV()" (it has futex_waitV()), yet in [0] I have found that futexes were added to IO_URING facility (for kernels >= 6.6), so I've thought I'll give it a shoot to writing own semaphores impl. with the ability to unlock many semaphores all at once in 1 io_uring_enter() syscall to notify all dozens of waiting backends in one go. Apparently Andres knows Jens :) and was their first idea for mixing IO_URING with futexes, but for different use-case if I understood it correctly it is mainly for future AIO use.
Anyway, apparently in this case it was easier to write atomic/futex implementation from scratch (and add liburing to the mix) somewhat based on [1] rather than trying to understand what glibc's is up to (that was my first idea; to just hack glibc). I'm not working any more on this patch as initial performance results show that this is NOT more efficient than standard POSIX semas that glibc provides by default. So I'm sharing this as it is, to avoid a simple `rm -f patch` as maybe someone else has another idea what to change, or in future it could be a starting base for faster/alternative locking implementations for someone else too. There are issues with the patch itself that I'm aware of : 1. critical bug where some backends are stuck in COMMIT (with concurrency >> VCPUs), maybe related to io_uring something acking submitting less (!) FUTEX_WAKE ops than it was asked for 2. this PGSemaphore impl uses just two states for atomics rather than the more optimal three states ([1] discusses the optimization in greater detail, not sure, maybe it is worth it?) Tests on 1s32c64t (AMD EPYC 7551) with: * shared_buffers = '1GB' * wal_buffers = '256MB' * max_connections = 10000 * huge_pages = 'on' * fsync = off Master: postgres@jw-test3:~$ cat commit.sql SELECT txid_current(); postgres@jw-test3:~$ /usr/pgsql18.master/bin/pgbench -c 1000 -j 1000 -p 5123 -T 20 -n -P 1 -f commit.sql pgbench (18devel) [..] latency average = 4.706 ms latency stddev = 7.811 ms initial connection time = 525.923 ms tps = 211550.518513 (without initial connection time) This patch: postgres@jw-test3:~$ /usr/pgsql18.uringsema/bin/pgbench -c 1000 -j 1000 -p 5123 -T 20 -n -P 1 -f commit.sql pgbench (18devel) [..] progress: 6.0 s, 198651.8 tps, lat 5.002 ms stddev 10.030, 0 failed progress: 7.0 s, 198939.6 tps, lat 4.973 ms stddev 9.923, 0 failed progress: 8.0 s, 200450.2 tps, lat 4.957 ms stddev 9.768, 0 failed progress: 9.0 s, 201333.6 tps, lat 4.918 ms stddev 9.651, 0 failed progress: 10.0 s, 201588.2 tps, lat 4.906 ms stddev 9.612, 0 failed progress: 11.0 s, 197024.4 tps, lat 5.009 ms stddev 10.152, 0 failed progress: 12.0 s, 187971.7 tps, lat 5.254 ms stddev 10.754, 0 failed progress: 13.0 s, 188385.1 tps, lat 5.238 ms stddev 10.789, 0 failed progress: 14.0 s, 190784.5 tps, lat 5.138 ms stddev 10.331, 0 failed progress: 15.0 s, 191809.5 tps, lat 5.118 ms stddev 10.253, 0 failed [..] progress: 19.0 s, 188017.3 tps, lat 5.210 ms stddev 10.674, 0 failed progress: 20.0 s, 174381.6 tps, lat 5.329 ms stddev 11.084, 0 failed <-- bug with the patch, it hangs on some sessions with COMMIT/ ProcarrayGroupUpdate , there's something that I'm missing :) If someone wants to bootstrap this: apt-get install 'linux-image-6.10.11+bpo-cloud-amd64' git clone https://github.com/axboe/liburing.git && cd liburing && ./configure && make install cd /git/postgres git checkout master && git pull # or 5b0c46ea0932e3be64081a277b5cc01fa9571689 # Uses already exising patch to bring liburing to PG # https://www.postgresql.org/message-id/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah@brqs62irg4dt wget https://www.postgresql.org/message-id/attachment/164657/v2.0-0006-aio-Add-liburing-dependency.patch patch -p1 < v2.0-0006-aio-Add-liburing-dependency.patch patch -p1 < uring_sema_patch_v1.diff #this patch meson setup build --prefix=/usr/pgsql18.uringsema cd build ; sudo ninja install /usr/pgsql18.uringsema/bin/pg_ctl -D /db/data -l logfile -m immediate stop; rm -rf /db/data/* ; /usr/pgsql18.uringsema/bin/initdb -D /db/data ; cp ~/postgresql.auto.conf /db/data; rm -f ~/logfile; /usr/pgsql18.uringsema/bin/pg_ctl -D /db/data -l logfile start In both cases top waits are always XidGenLock(~800 backends) + ProcarrayGroupUpdate(~150). -J. [0] - https://lwn.net/Articles/945891/ [1] - https://people.redhat.com/drepper/futex.pdf
uring_sema_patch_v1.diff
Description: Binary data