Hi, I am working on posting a patch series making relation extension more scalable. As part of that I was running some benchmarks for workloads that I thought should not or just positively impacted - but I was wrong, there was some very significant degradation at very high client counts. After pulling my hair out for quite a while to try to understand that behaviour, I figured out that it's just a side-effect of *removing* some other contention. This morning, turns out sleeping helps, I managed to reproduce it in an unmodified postgres.
$ cat ~/tmp/txid.sql SELECT txid_current(); $ for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n "$c ";pgbench -n -M prepared -f ~/tmp/txid.sql -c$c -j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done 1 60174 2 116169 4 208119 8 373685 16 515247 32 554726 64 497508 128 415097 256 334923 512 243679 768 192959 1024 157734 2048 82904 4096 32007 (I didn't properly round TPS, but that doesn't matter here) Performance completely falls off a cliff starting at ~256 clients. There's actually plenty CPU available here, so this isn't a case of running out of CPU time. Rather, the problem is very bad contention on the "spinlock" for the lwlock wait list. I realized that something in that direction was off when trying to investigate why I was seeing spin delays of substantial duration (>100ms). The problem isn't a fundamental issue with lwlocks, it's that LWLockDequeueSelf() does this: LWLockWaitListLock(lock); /* * Can't just remove ourselves from the list, but we need to iterate over * all entries as somebody else could have dequeued us. */ proclist_foreach_modify(iter, &lock->waiters, lwWaitLink) { if (iter.cur == MyProc->pgprocno) { found = true; proclist_delete(&lock->waiters, iter.cur, lwWaitLink); break; } } I.e. it iterates over the whole waitlist to "find itself". The longer the waitlist gets, the longer this takes. And the longer it takes for LWLockWakeup() to actually wake up all waiters, the more likely it becomes that LWLockDequeueSelf() needs to be called. We can't make the trivial optimization and use proclist_contains(), because PGPROC->lwWaitLink is also used for the list of processes to wake up in LWLockWakeup(). But I think we can solve that fairly reasonably nonetheless. We can change PGPROC->lwWaiting to not just be a boolean, but have three states: 0: not waiting 1: waiting in waitlist 2: waiting to be woken up which we then can use in LWLockDequeueSelf() to only remove ourselves from the list if we're on it. As removal from that list is protected by the wait list lock, there's no race to worry about. client patched HEAD 1 60109 60174 2 112694 116169 4 214287 208119 8 377459 373685 16 524132 515247 32 565772 554726 64 587716 497508 128 581297 415097 256 550296 334923 512 486207 243679 768 449673 192959 1024 410836 157734 2048 326224 82904 4096 250252 32007 Not perfect with the patch, but not awful either. I suspect this issue might actually explain quite a few odd performance behaviours we've seen at the larger end in the past. I think it has gotten a bit worse with the conversion of lwlock.c to proclists (I see lots of expensive multiplications to deal with sizeof(PGPROC)), but otherwise likely exists at least as far back as ab5194e6f61, in 9.5. I guess there's an argument for considering this a bug that we should backpatch a fix for? But given the vintage, probably not? The only thing that gives me pause is that this is quite hard to pinpoint as happening. I've attached my quick-and-dirty patch. Obviously it'd need a few defines etc, but I wanted to get this out to discuss before spending further time. Greetings, Andres Freund
diff --git i/src/include/storage/proc.h w/src/include/storage/proc.h index 8d096fdeeb1..9a2615666a1 100644 --- i/src/include/storage/proc.h +++ w/src/include/storage/proc.h @@ -217,7 +217,8 @@ struct PGPROC bool recoveryConflictPending; /* Info about LWLock the process is currently waiting for, if any. */ - bool lwWaiting; /* true if waiting for an LW lock */ + int lwWaiting; /* 0 if not waiting, 1 if on waitlist, 2 if + * waiting to be woken */ uint8 lwWaitMode; /* lwlock mode being waited for */ proclist_node lwWaitLink; /* position in LW lock wait list */ diff --git i/src/backend/storage/lmgr/lwlock.c w/src/backend/storage/lmgr/lwlock.c index d274c9b1dc9..ffb852c91d7 100644 --- i/src/backend/storage/lmgr/lwlock.c +++ w/src/backend/storage/lmgr/lwlock.c @@ -987,6 +987,9 @@ LWLockWakeup(LWLock *lock) wokeup_somebody = true; } + /* signal that the process isn't on the wait list anymore */ + waiter->lwWaiting = 2; + /* * Once we've woken up an exclusive lock, there's no point in waking * up anybody else. @@ -1044,7 +1047,7 @@ LWLockWakeup(LWLock *lock) * another lock. */ pg_write_barrier(); - waiter->lwWaiting = false; + waiter->lwWaiting = 0; PGSemaphoreUnlock(waiter->sem); } } @@ -1073,7 +1076,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode) /* setting the flag is protected by the spinlock */ pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_HAS_WAITERS); - MyProc->lwWaiting = true; + MyProc->lwWaiting = 1; MyProc->lwWaitMode = mode; /* LW_WAIT_UNTIL_FREE waiters are always at the front of the queue */ @@ -1101,7 +1104,6 @@ static void LWLockDequeueSelf(LWLock *lock) { bool found = false; - proclist_mutable_iter iter; #ifdef LWLOCK_STATS lwlock_stats *lwstats; @@ -1114,17 +1116,14 @@ LWLockDequeueSelf(LWLock *lock) LWLockWaitListLock(lock); /* - * Can't just remove ourselves from the list, but we need to iterate over - * all entries as somebody else could have dequeued us. + * Remove ourselves from the waitlist, unless we've already been + * removed. The removal happens with the wait list lock held, so there's + * no race in this check. */ - proclist_foreach_modify(iter, &lock->waiters, lwWaitLink) + if (MyProc->lwWaiting == 1) { - if (iter.cur == MyProc->pgprocno) - { - found = true; - proclist_delete(&lock->waiters, iter.cur, lwWaitLink); - break; - } + proclist_delete(&lock->waiters, MyProc->pgprocno, lwWaitLink); + found = true; } if (proclist_is_empty(&lock->waiters) && @@ -1138,7 +1137,7 @@ LWLockDequeueSelf(LWLock *lock) /* clear waiting state again, nice for debugging */ if (found) - MyProc->lwWaiting = false; + MyProc->lwWaiting = 0; else { int extraWaits = 0;