failed optimization attempt for ProcArrayGroupClearXid(): using custom PGSemaphores that use __atomics and futexes batching via IO_URING

Jakub Wartak Thu, 07 Nov 2024 01:59:35 -0800

Having little energy boost after meeting some VSP (Very Smart People) on
recent PGConfEu, I've attempted to pursue optimization about apparently
minor (?) inefficiency that I've spotted while researching extremely high
active backends on extreme max_connections (measured in thousands active
backends all on runq). Classic explanations are known: context switching,
ProcarrayGroupUpdate, yada yada yada, and it still was fun to investigate.
Yet after 0e141c0fbb2, the ProcArrayGroupClearXid() can be performing
dozens of repetitive calls to PGSemaphoreUnlock()->futex(). In such extreme
situation, each syscall is context switch and opportunity for the scheduler
to pick up different PID to run on that VCPU - I think I was able to verify
this via `perf sched map` if my (limited) interpretation (using timestamps)
was accurate this has real chance to occur. Linux on it's own does not have
vectorized futex unlock syscall like "futex_wakeV()" (it has
futex_waitV()), yet in [0] I have found that futexes were added to IO_URING
facility (for kernels >= 6.6), so I've thought I'll give it a shoot to
writing own semaphores impl. with the ability to unlock many semaphores all
at once in 1 io_uring_enter() syscall to notify all dozens of waiting
backends in one go. Apparently Andres knows Jens :) and was their first
idea for mixing IO_URING with futexes, but for different use-case if I
understood it correctly it is mainly for future AIO use.


Anyway, apparently in this case it was easier to write atomic/futex
implementation from scratch (and add liburing to the mix) somewhat based on
[1] rather than trying to understand what glibc's is up to (that was my
first idea; to just hack glibc). I'm not working any more on this patch as
initial performance results show that this is NOT more efficient than
standard POSIX semas that glibc provides by default. So I'm sharing this as
it is, to avoid a simple `rm -f patch` as maybe someone else has another
idea what to change, or in future it could be a starting base for
faster/alternative locking implementations for someone else too.

There are issues with the patch itself that I'm aware of :
1. critical bug where some backends are stuck in COMMIT (with concurrency
>> VCPUs), maybe related to io_uring something acking submitting less (!)
FUTEX_WAKE ops than it was asked for
2. this PGSemaphore impl uses just two states for atomics rather than the
more optimal three states ([1] discusses the optimization in greater
detail, not sure, maybe it is worth it?)

Tests on 1s32c64t (AMD EPYC 7551) with:
* shared_buffers = '1GB'
* wal_buffers = '256MB'
* max_connections = 10000
* huge_pages = 'on'
* fsync = off

Master:
postgres@jw-test3:~$ cat commit.sql
SELECT txid_current();
postgres@jw-test3:~$ /usr/pgsql18.master/bin/pgbench -c 1000 -j 1000 -p
5123 -T 20 -n -P 1 -f commit.sql
pgbench (18devel)
[..]
latency average = 4.706 ms
latency stddev = 7.811 ms
initial connection time = 525.923 ms
tps = 211550.518513 (without initial connection time)

This patch:
postgres@jw-test3:~$ /usr/pgsql18.uringsema/bin/pgbench -c 1000 -j 1000 -p
5123 -T 20 -n -P 1 -f commit.sql
pgbench (18devel)
[..]
progress: 6.0 s, 198651.8 tps, lat 5.002 ms stddev 10.030, 0 failed
progress: 7.0 s, 198939.6 tps, lat 4.973 ms stddev 9.923, 0 failed
progress: 8.0 s, 200450.2 tps, lat 4.957 ms stddev 9.768, 0 failed
progress: 9.0 s, 201333.6 tps, lat 4.918 ms stddev 9.651, 0 failed
progress: 10.0 s, 201588.2 tps, lat 4.906 ms stddev 9.612, 0 failed
progress: 11.0 s, 197024.4 tps, lat 5.009 ms stddev 10.152, 0 failed
progress: 12.0 s, 187971.7 tps, lat 5.254 ms stddev 10.754, 0 failed
progress: 13.0 s, 188385.1 tps, lat 5.238 ms stddev 10.789, 0 failed
progress: 14.0 s, 190784.5 tps, lat 5.138 ms stddev 10.331, 0 failed
progress: 15.0 s, 191809.5 tps, lat 5.118 ms stddev 10.253, 0 failed
[..]
progress: 19.0 s, 188017.3 tps, lat 5.210 ms stddev 10.674, 0 failed
progress: 20.0 s, 174381.6 tps, lat 5.329 ms stddev 11.084, 0 failed
<-- bug with the patch, it hangs on some sessions with COMMIT/
ProcarrayGroupUpdate , there's something that I'm missing :)

If someone wants to bootstrap this:

apt-get install 'linux-image-6.10.11+bpo-cloud-amd64'
git clone https://github.com/axboe/liburing.git && cd liburing &&
./configure && make install
cd /git/postgres
git checkout master && git pull # or
5b0c46ea0932e3be64081a277b5cc01fa9571689

# Uses already exising patch to bring liburing to PG
#
https://www.postgresql.org/message-id/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah@brqs62irg4dt
wget
https://www.postgresql.org/message-id/attachment/164657/v2.0-0006-aio-Add-liburing-dependency.patch
patch -p1 < v2.0-0006-aio-Add-liburing-dependency.patch
patch -p1 < uring_sema_patch_v1.diff #this patch
meson setup build --prefix=/usr/pgsql18.uringsema
cd build ;
sudo ninja install
/usr/pgsql18.uringsema/bin/pg_ctl -D /db/data -l logfile -m immediate stop;
rm -rf /db/data/* ; /usr/pgsql18.uringsema/bin/initdb -D /db/data ; cp
~/postgresql.auto.conf /db/data; rm -f ~/logfile;
/usr/pgsql18.uringsema/bin/pg_ctl -D /db/data -l logfile start

In both cases top waits are always XidGenLock(~800 backends) +
ProcarrayGroupUpdate(~150).

-J.

[0] - https://lwn.net/Articles/945891/
[1] - https://people.redhat.com/drepper/futex.pdf

uring_sema_patch_v1.diff
Description: Binary data

failed optimization attempt for ProcArrayGroupClearXid(): using custom PGSemaphores that use __atomics and futexes batching via IO_URING

Reply via email to