Postgres stucks in deadlock detection

Konstantin Knizhnik Wed, 04 Apr 2018 01:54:57 -0700

Hi hackers,
Please notice that it is not a first of April joke;)

Several times we and our customers are suffered from the problem thatPostgres got stuck in deadlock detection.One scenario is YCSB workload with Zipf's distribution when many clientsare trying to update the same record and compete for it's lock.Another scenario is large number of clients performing inserts in thesame table. In this case the source of the problem is relation extensionlock.

In both cases number of connection is large enough: several hundreds.

So what's happening? Due to high contention backends will not be able toobtains requested lock in the specified deadlock detection timeout (1second by default).Wait it interrupted by timeout and backend tries to perform deadlockdetection. CheckDeadLock sets exclusive lock on all partitions locks...Avalanche of deadlock timeout expirationin backends and there contention of exclusive partition locks causePostgres to got stucks.Speed falls down almost to zero and it is not possible even to login toPostgres.

It is well known fact that Postgres is not scaling well for such largernumber of connections and it is necessary to use pgbouncer or some otherconnection poolerto limit number of backends. But modern systems has hundreds of CPUcores. And to utilize all this resources we need to have hundreds ofactive backaneds.So this is not an artificial problem, but real show stopper, which takesplace on real workloads.


There are several ways to solve this problem.

First is trivial: increase deadlock detection timeout. In case of YCSBit helps. But in case of many concurrent inserts, some backends arewaiting for lock for several minutes.So there is no any realistic value of deadlock detection timeout whichcan completely solve the problem.Also significant increasing of deadlock detection timeout may caseblocking applications for unacceptable amount of time in case of realdeadlock occurrence.

There is a patch in commitfest proposed by Yury Sokolov:https://commitfest.postgresql.org/18/1602/He make deadlock check in two phases: first under shared lock and secondunder exclusive lock.

I am proposing much simpler patch (attached) which uses atomic flag toprevent concurrent deadlock detection by more than one backend.The obvious drawback of such solution is that detection of unrelateddeadlock loops may take larger amount of time. But deadlock is abnormalsituation in any case and I do not know applications which considerdeadlocks as normal behavior. Also I didn't see in my life situationswhen more than one independent deadlocks are happen at the same time(but obviously it is possible).


So, I see three possible ways to fix this problem:
1. Yury Sololov's patch with two phase deadlock check
2. Avoid concurrent deadlock detection

3. Avoid concurrent deadlock detection + let CheckDeadLock detect alldeadlocks, not only those in which current transaction is involved.

I want to know opinion of community concerning this approaches (or maywe there are some other solutions).


Thanks in advance,

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index bfa8499..6412184 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -86,6 +86,8 @@ static LOCALLOCK *lockAwaited = NULL;
 
 static DeadLockState deadlock_state = DS_NOT_YET_CHECKED;
 
+static bool inside_deadlock_check = false;
+
 /* Is a deadlock check pending? */
 static volatile sig_atomic_t got_deadlock_timeout;
 
@@ -186,6 +188,7 @@ InitProcGlobal(void)
 	ProcGlobal->walwriterLatch = NULL;
 	ProcGlobal->checkpointerLatch = NULL;
 	pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
+	pg_atomic_init_flag(&ProcGlobal->activeDeadlockCheck);
 
 	/*
 	 * Create and initialize all the PGPROC structures we'll need.  There are
@@ -754,6 +757,14 @@ ProcReleaseLocks(bool isCommit)
 {
 	if (!MyProc)
 		return;
+
+	/* Release deadlock detection flag is backend was interrupted inside deadlock check */
+	if (inside_deadlock_check)
+	{
+		pg_atomic_clear_flag(&ProcGlobal->activeDeadlockCheck);
+		inside_deadlock_check = false;
+	}
+
 	/* If waiting, get off wait queue (should only be needed after error) */
 	LockErrorCleanup();
 	/* Release standard locks, including session-level if aborting */
@@ -1658,6 +1669,14 @@ CheckDeadLock(void)
 	int			i;
 
 	/*
+	 * Ensure that only one backend is checking for deadlock.
+	 * Otherwise under high load cascade of deadlock timeout expirations can cause stuck of Postgres.
+	 */
+	if (!pg_atomic_test_set_flag(&ProcGlobal->activeDeadlockCheck))
+		return;
+	inside_deadlock_check = true;
+
+	/*
 	 * Acquire exclusive lock on the entire shared lock data structures. Must
 	 * grab LWLocks in partition-number order to avoid LWLock deadlock.
 	 *
@@ -1732,6 +1751,9 @@ CheckDeadLock(void)
 check_done:
 	for (i = NUM_LOCK_PARTITIONS; --i >= 0;)
 		LWLockRelease(LockHashPartitionLockByIndex(i));
+
+	pg_atomic_clear_flag(&ProcGlobal->activeDeadlockCheck);
+	inside_deadlock_check = false;
 }
 
 /*
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index e974f4e..f5c8b05 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -253,6 +253,8 @@ typedef struct PROC_HDR
 	int			startupProcPid;
 	/* Buffer id of the buffer that Startup process waits for pin on, or -1 */
 	int			startupBufferPinWaitBufId;
+	/* Deadlock detection is in progress */
+	pg_atomic_flag activeDeadlockCheck;
 } PROC_HDR;
 
 extern PGDLLIMPORT PROC_HDR *ProcGlobal;

Postgres stucks in deadlock detection

Reply via email to