Hoi hackers, Please find attached updated versions of the patches, I've now tested them. Also attached is a reproduction script to verify that they actually work.
To test you need to create 150 databases as described in the script, then simply execute it. Before patching you get the following results (last figure is the CPU usage of Postgres): 1559749330 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.01, Avg: 0.01 [0.01/0.01/0.01/0.01/0.01], 269.07% 1559749335 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.01, Avg: 0.01 [0.01/0.01/0.01/0.01/0.01], 268.07% 1559749340 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.01, Avg: 0.01 [0.01/0.01/0.01/0.01/0.01], 270.94% After patching you get the following: 1559749840 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.02, Avg: 0.01 [0.01/0.01/0.01/0.01/0.01], 5.09% 1559749845 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.01, Avg: 0.01 [0.01/0.01/0.01/0.01/0.01], 5.06% 1559749850 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.01, Avg: 0.01 [0.01/0.01/0.01/0.01/0.01], 4.75% The async queue functions in postgres also no longer appear in the perf output (below measuring threshold). As for general method, it seems like the actual optimisation here is that the async queue tail pointer is only updated once per SLRU page instead of every message. This would require a significantly larger patch, but shouldn't be too difficult. Thoughts? Have a nice day, Martijn On Tue, 4 Jun 2019 at 09:08, Martijn van Oosterhout <klep...@gmail.com> wrote: > > Hoi hackers, > > We've been having issues with NOTIFYs blocking over multiple databases > (see [1] for more details). That was 9.4 but we've updated the > database to 11.3 and still have the same issue. Now however we could > use perf to do profiling and got the following profile (useless > details elided): > > --32.83%--ProcessClientReadInterrupt > --32.68%--ProcessNotifyInterrupt > --32.16%--asyncQueueReadAllNotifications > --23.37%--asyncQueueAdvanceTail > --20.49%--LWLockAcquire > --18.93%--LWLockQueueSelf > --12.99%--LWLockWaitListLock > > (from: perf record -F 99 -ag -- sleep 600) > > That shows that more than 20% of the time is spent in that single > function, waiting for an exclusive lock on the AsyncQueueLock. This > will block any concurrent session doing a NOTIFY in any database on > the system. This would certainly explain the symptoms we're seeing > (process xxx still waiting for AccessExclusiveLock on object 0 of > class 1262 of database 0). > > Analysis of the code leads me to the following hypothesis (and hence > to the attached patches): > > We have ~150 databases, each of which has 2 active backends with an > active LISTEN. When a NOTIFY happens anywhere on any database it > (under an exclusive lock) makes a list of 300 backends to send a > signal to. It then wakes up all of those backends. Each backend then > examines the message and all but one discards it as being for the > wrong database. Each backend then calls asyncQueueAdvanceTail (because > the current position of the each backend was the tail) which then > takes an exclusive lock and checks all the other backends to see if > the tail can be advanced. All of these will conclude 'no', except the > very last one which concludes the tail can be advanced by about 50 > bytes or so. > > So the inner loop of asyncQueueAdvanceTail will, while holding a > global exclusive lock, execute 2*150*4000 (max backends) = 1.2 million > times for basically no benefit. During this time, no other transaction > anywhere in the system that does a NOTIFY will be able to commit. > > The attached patches attempt reduce the overhead in two ways: > > Patch 1: Changes asyncQueueAdvanceTail to do nothing unless the > QUEUE_HEAD is on a different page than the QUEUE_TAIL. The idea is > that there's no point trying to advance the tail unless we can > actually usefully truncate the SLRU. This does however mean that > asyncQueueReadAllNotifications always has to call > asyncQueueAdvanceTail since it can no longer be guaranteed that any > backend is still at the tail, which is one of the assumptions of the > current code. Not sure if this is a problem or if it can be improved > without tracking much more state. > > Patch 2: Changes SignalBackends to only notify other backends when (a) > they're the same database as me or (b) the notify queue has advanced > to a new SLRU page. This avoids backends being woken up for messages > which they are not interested in. > > As a consequence of these changes, we can reduce the number of > exclusive locks and backend wake ups in our case by a factor of 300. > You still however get a thundering herd at the end of each SLRU page. > > Note: these patches have not yet been extensively tested, and so > should only be used as basis for discussion. > > Comments? Suggestions? > > [1] > https://www.postgresql.org/message-id/cadwg95t0j9zf0uwdcmh81kmndsitavhxmbvgyqrrjcd-ilw...@mail.gmail.com > > -- > Martijn van Oosterhout <klep...@gmail.com> http://svana.org/kleptog/ -- Martijn van Oosterhout <klep...@gmail.com> http://svana.org/kleptog/
From 67cc65b0c98d8282602b414d0ffaff158bf887f7 Mon Sep 17 00:00:00 2001 From: Martijn van Oosterhout <oosterh...@fox-it.com> Date: Tue, 4 Jun 2019 16:14:31 +0200 Subject: [PATCH 3/3] Quickly bail if queue tail can't be moved. --- src/backend/commands/async.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c index 18bd3e975e..f0a5a2472f 100644 --- a/src/backend/commands/async.c +++ b/src/backend/commands/async.c @@ -2022,7 +2022,11 @@ asyncQueueAdvanceTail(void) for (i = 1; i <= MaxBackends; i++) { if (QUEUE_BACKEND_PID(i) != InvalidPid) + { min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i)); + if (QUEUE_POS_EQUAL(min, QUEUE_TAIL)) + break; + } } oldtailpage = QUEUE_POS_PAGE(QUEUE_TAIL); QUEUE_TAIL = min; -- 2.11.0
From 385f3729beb9e1e4bcf77e1428925a76e3300fdd Mon Sep 17 00:00:00 2001 From: Martijn van Oosterhout <oosterh...@fox-it.com> Date: Mon, 3 Jun 2019 17:10:38 +0200 Subject: [PATCH 2/3] Don't notify other backends about notifications without cause. It could be that there are other databases with active listens but unless they need to know because it may be useful to advance the queue tail, there's no point waking them up. --- src/backend/commands/async.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c index 53e28fe777..18bd3e975e 100644 --- a/src/backend/commands/async.c +++ b/src/backend/commands/async.c @@ -1519,11 +1519,13 @@ SignalBackends(void) int count; int i; int32 pid; + int notify_all = false; /* * Identify all backends that are listening and not already up-to-date. We * don't want to send signals while holding the AsyncQueueLock, so we just - * build a list of target PIDs. + * build a list of target PIDs. If we haven't moved to a new page there is + * no point notifying backends of other databases. * * XXX in principle these pallocs could fail, which would be bad. Maybe * preallocate the arrays? But in practice this is only run in trivial @@ -1532,6 +1534,7 @@ SignalBackends(void) pids = (int32 *) palloc(MaxBackends * sizeof(int32)); ids = (BackendId *) palloc(MaxBackends * sizeof(BackendId)); count = 0; + notify_all = QUEUE_POS_PAGE(QUEUE_HEAD) != QUEUE_POS_PAGE(QUEUE_TAIL); LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE); for (i = 1; i <= MaxBackends; i++) @@ -1541,6 +1544,9 @@ SignalBackends(void) { QueuePosition pos = QUEUE_BACKEND_POS(i); + if (!notify_all && QUEUE_BACKEND_DBOID(i) != MyDatabaseId) + continue; + if (!QUEUE_POS_EQUAL(pos, QUEUE_HEAD)) { pids[count] = pid; -- 2.11.0
From c0db16ec3bb971fb41b9e7f568b5cb168aa462ba Mon Sep 17 00:00:00 2001 From: Martijn van Oosterhout <oosterh...@fox-it.com> Date: Mon, 3 Jun 2019 17:13:31 +0200 Subject: [PATCH 1/3] Only try advancing tail pointer when it's useful. Advancing the tail pointer requires an exclusive lock which can block backends from other databases, so it's worth keeping these attempts to a minimum. --- src/backend/commands/async.c | 29 +++++++++++++++-------------- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c index 738e6ec7e2..53e28fe777 100644 --- a/src/backend/commands/async.c +++ b/src/backend/commands/async.c @@ -89,11 +89,9 @@ * Inbound-notify processing consists of reading all of the notifications * that have arrived since scanning last time. We read every notification * until we reach either a notification from an uncommitted transaction or - * the head pointer's position. Then we check if we were the laziest - * backend: if our pointer is set to the same position as the global tail - * pointer is set, then we move the global tail pointer ahead to where the - * second-laziest backend is (in general, we take the MIN of the current - * head position and all active backends' new tail pointers). Whenever we + * the head pointer's position. Then we check if we can advance the global + * tail pointer to a new page, if so we take the MIN of the current + * head position and all active backends' new tail pointers. Whenever we * move the global tail pointer we also truncate now-unused pages (i.e., * delete files in pg_notify/ that are no longer used). * @@ -1749,7 +1747,6 @@ asyncQueueReadAllNotifications(void) QueuePosition oldpos; QueuePosition head; Snapshot snapshot; - bool advanceTail; /* page_buffer must be adequately aligned, so use a union */ union @@ -1871,12 +1868,10 @@ asyncQueueReadAllNotifications(void) /* Update shared state */ LWLockAcquire(AsyncQueueLock, LW_SHARED); QUEUE_BACKEND_POS(MyBackendId) = pos; - advanceTail = QUEUE_POS_EQUAL(oldpos, QUEUE_TAIL); LWLockRelease(AsyncQueueLock); - /* If we were the laziest backend, try to advance the tail pointer */ - if (advanceTail) - asyncQueueAdvanceTail(); + /* Check if tail can be advanced */ + asyncQueueAdvanceTail(); PG_RE_THROW(); } @@ -1885,12 +1880,10 @@ asyncQueueReadAllNotifications(void) /* Update shared state */ LWLockAcquire(AsyncQueueLock, LW_SHARED); QUEUE_BACKEND_POS(MyBackendId) = pos; - advanceTail = QUEUE_POS_EQUAL(oldpos, QUEUE_TAIL); LWLockRelease(AsyncQueueLock); - /* If we were the laziest backend, try to advance the tail pointer */ - if (advanceTail) - asyncQueueAdvanceTail(); + /* Check if tail can be advanced */ + asyncQueueAdvanceTail(); /* Done with snapshot */ UnregisterSnapshot(snapshot); @@ -2010,6 +2003,14 @@ asyncQueueAdvanceTail(void) int newtailpage; int boundary; + /* + * Advancing the tail is expensive (it takes an exclusive lock which + * can block committing backends) so don't bother if we can't advance + * at least a page. + */ + if (QUEUE_POS_PAGE(QUEUE_TAIL) == QUEUE_POS_PAGE(QUEUE_HEAD)) + return; + LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE); min = QUEUE_HEAD; for (i = 1; i <= MaxBackends; i++) -- 2.11.0
import math import psycopg2 import random import select import time # To use this test you need to create 150 test databases like so: # for i in $(seq -w 0 149) ; do createdb --cluster 11/main test_${i} ; done # For CPU usage it assumes postgresql is running under systemd PORT=5432 NUM_DATABASES = 150 NUM_PER_DATABASE = 2 NOTIFY_PER_SECOND = 100 def get_usage(): with open("/sys/fs/cgroup/cpu/system.slice/system-postgresql.slice/cpuacct.usage") as f: return int(f.read()) def perc(d, p): pos = (len(d)-1)*p pos_low = d[int(math.floor(pos))] * (1-math.modf(pos)[0]) pos_high = d[int(math.ceil(pos))] * math.modf(pos)[0] return pos_low+pos_high CONNS = {} REV_CONNS = {} poller = select.poll() for i in range(NUM_DATABASES * NUM_PER_DATABASE): CONNS[i] = psycopg2.connect("dbname=test_%03d port=%s" % (int(i/NUM_PER_DATABASE), PORT)) CONNS[i].set_isolation_level(psycopg2.extensions.ISOLATION_LEVEL_AUTOCOMMIT) REV_CONNS[CONNS[i].fileno()] = i cur = CONNS[i].cursor() cur.execute("LISTEN changes") poller.register(CONNS[i], select.POLLIN) counter = 0 next_time = time.time() last_update = int(next_time / 5) * 5 stat_sent = 0 stat_recv = 0 stat_delays = [] stat_usage = get_usage() while True: t = time.time() while t > next_time: r = random.randrange(NUM_DATABASES * NUM_PER_DATABASE) CONNS[r].cursor().execute("NOTIFY changes, '%s'" % counter) # print "%.2f NOTIFY on %d" % (t, r) counter += 1 stat_sent += 1 next_time += 1.0/NOTIFY_PER_SECOND t = time.time() # Process self-notify while CONNS[r].notifies: stat_recv += 1 notify = CONNS[r].notifies.pop(0) # print "%.2f Got NOTIFY on %d:" % (t, r), notify.pid, notify.channel, notify.payload stat_delays.append(float(counter - int(notify.payload))/NOTIFY_PER_SECOND) while t < next_time: r = poller.poll(1000*(next_time-t)) for fd, _event in r: c = CONNS[REV_CONNS[fd]] c.poll() while c.notifies: stat_recv += 1 notify = c.notifies.pop(0) # print "%.2f Got NOTIFY on %d:" % (t, REV_CONNS[fd]), notify.pid, notify.channel, notify.payload stat_delays.append(float(counter - int(notify.payload))/NOTIFY_PER_SECOND) t = time.time() if t-last_update > 5: stat_delays.sort() new_usage = get_usage() pct_cpu = (new_usage-stat_usage)/(t-last_update)/1000000000 print "%d Sent: %d, Recv: %d, Delays: Min: %.2f, Max: %.2f, Avg: %.2f [%.2f/%.2f/%.2f/%.2f/%.2f], %.2f%%" % (last_update, stat_sent, stat_recv, min(stat_delays), max(stat_delays), sum(stat_delays)/len(stat_delays), perc(stat_delays, 0.05), perc(stat_delays, 0.25), perc(stat_delays, 0.5), perc(stat_delays, 0.75), perc(stat_delays, 0.95), pct_cpu*100) stat_sent = 0 stat_recv = 0 stat_delays = [] last_update += 5 stat_usage = new_usage