Hey Chengyu, How did you set up your GDB to get "p" to pretty-print a Postgres list?
Cheers, Jesse On Fri, Nov 30, 2018 at 1:00 PM Chengchao Yu <chen...@microsoft.com> wrote: > Greetings, > > > > Recently, we hit a few occurrences of deadlock when IO failure (including > disk full, random remote disk IO failures) happens in single user mode. We > found the issue exists on both Linux and Windows in multiple postgres > versions. > > > > Here are the steps to repro on Linux (as Windows repro is similar): > > > > 1. Get latest PostgreSQL code, build and install the executables. > > > > $ git clone https://git.postgresql.org/git/postgresql.git > > $ cd postgresql > > $ PGROOT=$(pwd) > > $ git checkout REL_11_STABLE > > $ mkdir build > > $ cd build > > $ ../configure --prefix=/path/to/postgres > > $ make && make install > > > > 2. Run initdb to initialize a PG database folder. > > > > $ /path/to/postgres/bin/initdb -D /path/to/data > > > > 3. Because the unable to write relation data scenario is difficult > to hit naturally even reserved space is turned off, I have prepared a small > patch (see attachment “emulate-error.patch”) to force an error when PG > tries to write data to relation files. We can just apply the patch and > there is no need to put efforts flooding data to disk any more. > > > > $ cd $PGROOT > > $ git apply /path/to/emulate-error.patch > > $ cd build > > $ make && make install > > > > 4. Connect to the newly initialized database cluster with single > user mode, create a table, and insert some data to the table, do a > checkpoint or directly give EOF. Then we hit the deadlock issue and the > process will not exit until we kill it. > > > > Do a checkpoint explicitly: > > > > $ /path/to/postgres/bin/postgres --single -D /path/to/data/ postgres -c > exit_on_error=true <<EOF > > > create table t1(a int); > > > insert into t1 values (1), (2), (3); > > > checkpoint; > > > EOF > > > > PostgreSQL stand-alone backend 11.1 > > backend> backend> backend> 2018-11-29 02:45:27.891 UTC [18806] FATAL: > Emulate exception in mdwrite() when writing to disk > > 2018-11-29 02:55:27.891 UTC [18806] CONTEXT: writing block 8 of relation > base/12368/1247 > > 2018-11-29 02:55:27.891 UTC [18806] STATEMENT: checkpoint; > > > > 2018-11-29 02:55:27.900 UTC [18806] FATAL: Emulate exception in mdwrite() > when writing to disk > > 2018-11-29 02:55:27.900 UTC [18806] CONTEXT: writing block 8 of relation > base/12368/1247 > > > > Or directly give an EOF: > > > > $ /path/to/postgres/bin/postgres --single -D /path/to/data/ postgres -c > exit_on_error=true <<EOF > > > create table t1(a int); > > > insert into t1 values (1), (2), (3); > > > EOF > > > > PostgreSQL stand-alone backend 11.1 > > backend> backend> backend> 2018-11-29 02:55:24.438 UTC [18149] FATAL: > Emulate exception in mdwrite() when writing to disk > > 2018-11-29 02:45:24.438 UTC [18149] CONTEXT: writing block 8 of relation > base/12368/1247 > > > > 5. Moreover, when we try to recover the database with single user > mode, we hit the issue again, and the process does not bring up the > database nor exit. > > > > $ /path/to/postgres/bin/postgres --single -D /path/to/data/ postgres -c > exit_on_error=true > > 2018-11-29 02:59:33.257 UTC [19058] LOG: database system shutdown was > interrupted; last known up at 2018-11-29 02:58:49 UTC > > 2018-11-29 02:59:33.485 UTC [19058] LOG: database system was not properly > shut down; automatic recovery in progress > > 2018-11-29 02:59:33.500 UTC [19058] LOG: redo starts at 0/1672E40 > > 2018-11-29 02:59:33.500 UTC [19058] LOG: invalid record length at > 0/1684B90: wanted 24, got 0 > > 2018-11-29 02:59:33.500 UTC [19058] LOG: redo done at 0/1684B68 > > 2018-11-29 02:59:33.500 UTC [19058] LOG: last completed transaction was > at log time 2018-11-29 02:58:49.856663+00 > > 2018-11-29 02:59:33.547 UTC [19058] FATAL: Emulate exception in mdwrite() > when writing to disk > > 2018-11-29 02:59:33.547 UTC [19058] CONTEXT: writing block 8 of relation > base/12368/1247 > > > > Analyses: > > > > So, what happened? Actually, there are 2 types of the deadlock due to the > same root cause. Let’s first take a look at the scenario in step #5. In > this scenario, the deadlock happens when disk IO failure occurs inside > StartupXLOG(). If we attach debugger to PG process, we will see the > process is stuck acquiring the buffer’s lw-lock in AbortBufferIO(). > > > > void > > AbortBufferIO(void) > > { > > BufferDesc *buf = InProgressBuf; > > > > if (buf) > > { > > uint32 buf_state; > > > > /* > > * Since LWLockReleaseAll has already been called, we're not > holding > > * the buffer's io_in_progress_lock. We have to re-acquire it so > that > > * we can use TerminateBufferIO. Anyone who's executing WaitIO on > the > > * buffer will be in a busy spin until we succeed in doing this. > > */ > > LWLockAcquire(BufferDescriptorGetIOLock(buf), LW_EXCLUSIVE); > > > > This is because the same lock has been acquired before buffer manager > attempts to flush the buffer page, which happens in StartBufferIO(). > > > > static bool > > StartBufferIO(BufferDesc *buf, bool forInput) > > { > > uint32 buf_state; > > > > Assert(!InProgressBuf); > > > > for (;;) > > { > > /* > > * Grab the io_in_progress lock so that other processes can wait > for > > * me to finish the I/O. > > */ > > LWLockAcquire(BufferDescriptorGetIOLock(buf), LW_EXCLUSIVE); > > > > buf_state = LockBufHdr(buf); > > > > if (!(buf_state & BM_IO_IN_PROGRESS)) > > break; > > > > After reading the code, AtProcExit_Buffers() assumes all the lw-locks are > released. However, in single user mode, at the time StartupXLOG() is > being executed, there is no before_shmem_exit/on_shmem_exit callback > registered to release the lw-locks. > > And, given lw-lock is non-reentrant, so the process gets stuck > re-acquiring the same lock. > > > > Here is the call stack: > > > > (gdb) bt > > #0 0x00007f0fdb7cb6d6 in futex_abstimed_wait_cancelable (private=128, > abstime=0x0, expected=0, futex_word=0x7f0fd14c81b8) at > ../sysdeps/unix/sysv/linux/futex-internal.h:205 > > #1 do_futex_wait (sem=sem@entry=0x7f0fd14c81b8, abstime=0x0) at > sem_waitcommon.c:111 > > #2 0x00007f0fdb7cb7c8 in __new_sem_wait_slow (sem=0x7f0fd14c81b8, > abstime=0x0) at sem_waitcommon.c:181 > > #3 0x00005630d475658a in PGSemaphoreLock (sema=0x7f0fd14c81b8) at > pg_sema.c:316 > > #4 0x00005630d47f689e in LWLockAcquire (lock=0x7f0fd9ae9c00, > mode=LW_EXCLUSIVE) at > /path/to/postgres/source/build/../src/backend/storage/lmgr/lwlock.c:1243 > > #5 0x00005630d47cd087 in *AbortBufferIO* () at > /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:3988 > > #6 0x00005630d47cb3f9 in AtProcExit_Buffers (code=1, arg=0) at > /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2473 > > #7 0x00005630d47dbc32 in shmem_exit (code=1) at > /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:272 > > #8 0x00005630d47dba5e in proc_exit_prepare (code=1) at > /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:194 > > #9 0x00005630d47db9c6 in proc_exit (code=1) at > /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:107 > > #10 0x00005630d49811bc in *errfinish* (dummy=0) at > /path/to/postgres/source/build/../src/backend/utils/error/elog.c:541 > > #11 0x00005630d4801f1f in mdwrite (reln=0x5630d6588c68, > forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7f0fd1ae9c00 "", > skipFsync=false) at > /path/to/postgres/source/build/../src/backend/storage/smgr/md.c:843 > > #12 0x00005630d4804716 in smgrwrite (reln=0x5630d6588c68, > forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7f0fd1ae9c00 "", > skipFsync=false) at > /path/to/postgres/source/build/../src/backend/storage/smgr/smgr.c:650 > > #13 0x00005630d47cb824 in FlushBuffer (buf=0x7f0fd19e9c00, > reln=0x5630d6588c68) at > /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2751 > > #14 0x00005630d47cb219 in SyncOneBuffer (buf_id=0, > skip_recently_used=false, wb_context=0x7ffccc371a70) at > /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2394 > > #15 0x00005630d47cab00 in BufferSync (flags=6) at > /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:1984 > > #16 0x00005630d47cb57f in CheckPointBuffers (flags=6) at > /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2578 > > #17 0x00005630d44a685b in CheckPointGuts (checkPointRedo=23612304, > flags=6) at > /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:9149 > > #18 0x00005630d44a62cf in CreateCheckPoint (flags=6) at > /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:8937 > > #19 0x00005630d44a45e3 in *StartupXLOG* () at > /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:7723 > > #20 0x00005630d4995f88 in InitPostgres (in_dbname=0x5630d65582b0 > "postgres", dboid=0, username=0x5630d653d7d0 "chengyu", useroid=0, > out_dbname=0x0, override_allow_connections=false) > > at > /path/to/postgres/source/build/../src/backend/utils/init/postinit.c:636 > > #21 0x00005630d480b68b in PostgresMain (argc=7, argv=0x5630d6534d20, > dbname=0x5630d65582b0 "postgres", username=0x5630d653d7d0 "chengyu") at > /path/to/postgres/source/build/../src/backend/tcop/postgres.c:3810 > > #22 0x00005630d4695e8b in main (argc=7, argv=0x5630d6534d20) at > /path/to/postgres/source/build/../src/backend/main/main.c:224 > > > > (gdb) p on_shmem_exit_list > > $1 = {{function = 0x55801cc68f5f <AnonymousShmemDetach>, arg = 0}, > > {function = 0x55801cc68a4f <IpcMemoryDelete>, arg = 2490396}, > > {function = 0x55801cc689f8 <IpcMemoryDetach>, arg = 140602018975744}, > > {function = 0x55801cc6842e <ReleaseSemaphores>, arg = 0}, > > {function = 0x55801ccec48a <dsm_postmaster_shutdown>, arg = > 140602018975744}, > > {function = 0x55801cd04053 <ProcKill>, arg = 0}, > > {function = 0x55801cd0402d <RemoveProcFromArray>, arg = 0}, > > {function = 0x55801ccf74e8 <CleanupInvalidationState>, arg = > 140601991817088}, > > {function = 0x55801ccf446f <CleanupProcSignalState>, arg = 1}, > > {function = 0x55801ccdd3e5 <AtProcExit_Buffers>, arg = 0}, {function = > 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, > {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg > = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = > 0x0, arg = 0}, {function = 0x0, arg = 0}} > > (gdb) p before_shmem_exit_list > > $2 = {{function = 0x0, arg = 0} <repeats 20 times>} > > > > The second type is in Step #4. At the time when “checkpoint” SQL command > is being executed, PG has already set up the before_shmem_exit callback > ShutdownPostgres(), which releases all lw-locks given transaction or > sub-transaction is on-going. So after the first IO error, the buffer page’s > lw-lock gets released successfully. However, later ShutdownXLOG() is > invoked, and PG tries to flush buffer pages again, which results in the > second IO error. Different from the first time, this time, all the previous > executed before/on_shmem_exit callbacks are not invoked again due to the > decrease of the indexes. So lw-locks for buffer pages are not released when > PG tries to get the same buffer lock in AbortBufferIO(), and then PG > process gets stuck. > > > > Here is the call stack: > > > > (gdb) bt > > #0 0x00007ff0c0c036d6 in futex_abstimed_wait_cancelable (private=128, > abstime=0x0, expected=0, futex_word=0x7ff0b69001b8) at > ../sysdeps/unix/sysv/linux/futex-internal.h:205 > > #1 do_futex_wait (sem=sem@entry=0x7ff0b69001b8, abstime=0x0) at > sem_waitcommon.c:111 > > #2 0x00007ff0c0c037c8 in __new_sem_wait_slow (sem=0x7ff0b69001b8, > abstime=0x0) at sem_waitcommon.c:181 > > #3 0x0000562077cc258a in PGSemaphoreLock (sema=0x7ff0b69001b8) at > pg_sema.c:316 > > #4 0x0000562077d6289e in LWLockAcquire (lock=0x7ff0bef225c0, > mode=LW_EXCLUSIVE) at > /path/to/postgres/source/build/../src/backend/storage/lmgr/lwlock.c:1243 > > #5 0x0000562077d39087 in *AbortBufferIO* () at > /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:3988 > > #6 0x0000562077d373f9 in AtProcExit_Buffers (code=1, arg=0) at > /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2473 > > #7 0x0000562077d47c32 in shmem_exit (code=1) at > /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:272 > > #8 0x0000562077d47a5e in proc_exit_prepare (code=1) at > /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:194 > > #9 0x0000562077d479c6 in proc_exit (code=1) at > /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:107 > > #10 0x0000562077eed1bc in *errfinish* (dummy=0) at > /path/to/postgres/source/build/../src/backend/utils/error/elog.c:541 > > #11 0x0000562077d6df1f in mdwrite (reln=0x562078a12a18, > forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7ff0b6fbdc00 "", > skipFsync=false) at > /path/to/postgres/source/build/../src/backend/storage/smgr/md.c:843 > > #12 0x0000562077d70716 in smgrwrite (reln=0x562078a12a18, > forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7ff0b6fbdc00 "", > skipFsync=false) at > /path/to/postgres/source/build/../src/backend/storage/smgr/smgr.c:650 > > #13 0x0000562077d37824 in FlushBuffer (buf=0x7ff0b6e22f80, > reln=0x562078a12a18) at > /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2751 > > #14 0x0000562077d37219 in SyncOneBuffer (buf_id=78, > skip_recently_used=false, wb_context=0x7fffb0e3e230) at > /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2394 > > #15 0x0000562077d36b00 in BufferSync (flags=5) at > /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:1984 > > #16 0x0000562077d3757f in CheckPointBuffers (flags=5) at > /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2578 > > #17 0x0000562077a1285b in CheckPointGuts (checkPointRedo=24049152, > flags=5) at > /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:9149 > > #18 0x0000562077a122cf in CreateCheckPoint (flags=5) at > /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:8937 > > #19 0x0000562077a1164f in *ShutdownXLOG* (code=1, arg=0) at > /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:8485 > > #20 0x0000562077d47c32 in shmem_exit (code=1) at > /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:272 > > #21 0x0000562077d47a5e in proc_exit_prepare (code=1) at > /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:194 > > #22 0x0000562077d479c6 in proc_exit (code=1) at > /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:107 > > #23 0x0000562077eed1bc in *errfinish* (dummy=0) at > /path/to/postgres/source/build/../src/backend/utils/error/elog.c:541 > > #24 0x0000562077d6df1f in mdwrite (reln=0x562078a12a18, > forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7ff0b6fbdc00 "", > skipFsync=false) at > /path/to/postgres/source/build/../src/backend/storage/smgr/md.c:843 > > #25 0x0000562077d70716 in smgrwrite (reln=0x562078a12a18, > forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7ff0b6fbdc00 "", > skipFsync=false) at > /path/to/postgres/source/build/../src/backend/storage/smgr/smgr.c:650 > > #26 0x0000562077d37824 in FlushBuffer (buf=0x7ff0b6e22f80, > reln=0x562078a12a18) at > /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2751 > > #27 0x0000562077d37219 in SyncOneBuffer (buf_id=78, > skip_recently_used=false, wb_context=0x7fffb0e3fb10) at > /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2394 > > #28 0x0000562077d36b00 in BufferSync (flags=44) at > /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:1984 > > #29 0x0000562077d3757f in CheckPointBuffers (flags=44) at > /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2578 > > #30 0x0000562077a1285b in CheckPointGuts (checkPointRedo=24049152, > flags=44) at > /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:9149 > > #31 0x0000562077a122cf in CreateCheckPoint (flags=44) at > /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:8937 > > #32 0x0000562077cca792 in RequestCheckpoint (flags=44) at > /path/to/postgres/source/build/../src/backend/postmaster/checkpointer.c:976 > > #33 0x0000562077d7bce4 in standard_ProcessUtility (pstmt=0x562078a00b50, > queryString=0x562078a00100 "checkpoint;\n", > context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, > > dest=0x5620783ac5e0 <debugtupDR>, completionTag=0x7fffb0e41520 "") at > /path/to/postgres/source/build/../src/backend/tcop/utility.c:769 > > #34 0x0000562077d7b204 in ProcessUtility (pstmt=0x562078a00b50, > queryString=0x562078a00100 "checkpoint;\n", > context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, > dest=0x5620783ac5e0 <debugtupDR>, > > completionTag=0x7fffb0e41520 "") at > /path/to/postgres/source/build/../src/backend/tcop/utility.c:360 > > #35 0x0000562077d7a347 in PortalRunUtility (portal=0x5620789f20c0, > pstmt=0x562078a00b50, isTopLevel=true, setHoldSnapshot=false, > dest=0x5620783ac5e0 <debugtupDR>, completionTag=0x7fffb0e41520 "") > > at /path/to/postgres/source/build/../src/backend/tcop/pquery.c:1178 > > #36 0x0000562077d7a534 in PortalRunMulti (portal=0x5620789f20c0, > isTopLevel=true, setHoldSnapshot=false, dest=0x5620783ac5e0 <debugtupDR>, > altdest=0x5620783ac5e0 <debugtupDR>, > > completionTag=0x7fffb0e41520 "") at > /path/to/postgres/source/build/../src/backend/tcop/pquery.c:1324 > > #37 0x0000562077d79a61 in PortalRun (portal=0x5620789f20c0, > count=9223372036854775807, isTopLevel=true, run_once=true, > dest=0x5620783ac5e0 <debugtupDR>, altdest=0x5620783ac5e0 <debugtupDR>, > > completionTag=0x7fffb0e41520 "") at > /path/to/postgres/source/build/../src/backend/tcop/pquery.c:799 > > #38 0x0000562077d734c5 in exec_simple_query (query_string=0x562078a00100 > "checkpoint;\n") at > /path/to/postgres/source/build/../src/backend/tcop/postgres.c:1145 > > #39 0x0000562077d77bd5 in PostgresMain (argc=7, argv=0x562078980d20, > dbname=0x5620789a42b0 "postgres", username=0x5620789897d0 "chengyu") at > /path/to/postgres/source/build/../src/backend/tcop/postgres.c:4182 > > #40 0x0000562077c01e8b in main (argc=7, argv=0x562078980d20) at > /path/to/postgres/source/build/../src/backend/main/main.c:224 > > > > (gdb) p on_shmem_exit_list > > $9 = {{function = 0x562077cc2f5f <AnonymousShmemDetach>, arg = 0}, > > {function = 0x562077cc2a4f <IpcMemoryDelete>, arg = 2457627}, > > {function = 0x562077cc29f8 <IpcMemoryDetach>, arg = 140672005165056}, > > {function = 0x562077cc242e <ReleaseSemaphores>, arg = 0}, > > {function = 0x562077d4648a <dsm_postmaster_shutdown>, arg = > 140672005165056}, > > {function = 0x562077d5e053 <ProcKill>, arg = 0}, > > {function = 0x562077d5e02d <RemoveProcFromArray>, arg = 0}, > > {function = 0x562077d514e8 <CleanupInvalidationState>, arg = > 140671978006400}, > > {function = 0x562077d4e46f <CleanupProcSignalState>, arg = 1}, > > {function = 0x562077d373e5 <AtProcExit_Buffers>, arg = 0}, > > {function = 0x562077a1159d <ShutdownXLOG>, arg = 0}, > > {function = 0x562077cd0637 <pgstat_beshutdown_hook>, arg = 0}, {function = > 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, > {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg > = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}} > > (gdb) p before_shmem_exit_list > > $10 = {{function = 0x562077f02caa <ShutdownPostgres>, arg = 0}, {function > = 0x0, arg = 0} <repeats 19 times>} > > > > OK, now we understand the deadlock issue for single user mode. However, > will this issue affect multi-user mode (i.e. under postmaster process)? We > can have 3 cases for discussion: > > > > 1. Startup process: at the time StartupXLOG() is invoked, > ShutdownAuxiliaryProcess(), which will release all the lw-locks, has > been already registered in before_shmem_exit_list[]. So this case is > safe. > 2. Checkpointer process: ShutdownXLOG() is not registered as a > before/on_shmem_exit callback, instead, it’s only invoked in the main loop. > So there is no chance to hit IO error for second time during shared memory > exit callbacks. Also, Same as startup process, > ShutdownAuxiliaryProcess() has been registered. So this case is also > safe. > 3. Other backend/background processes: these processes do not handle > XLOG startup or shutdown, and are protected by > ShutdownAuxiliaryProcess(). So they are safe to exit too. > > > > In addition, we have done multiple experiments to confirm these cases. > > > > Affected versions: we found this issue in 9.5, 9.6, 10, 11 and 12devel. > > > > Fix proposal: > > > > According to the affected 2 types of deadlock in single user mode > discussed above, there might be multiple ways to fix this issue. In the fix > proposal we would like to present, we register a new callback to release > all the lw-locks (just like what ShutdownAuxiliaryProcess()does) in an > order after AtProcExit_Buffers() and before ShutdownXLOG(). Also, it is > registered before PG enters StartupXLOG(), so it can cover the case when > ShutdownPostgres() has not been registered. Here is the content of the > proposal: > > > > diff --git a/src/backend/utils/init/postinit.c > b/src/backend/utils/init/postinit.c > > index 62baaf0ab3..d74e8aa1d5 100644 > > --- a/src/backend/utils/init/postinit.c > > +++ b/src/backend/utils/init/postinit.c > > @@ -71,6 +71,7 @@ static HeapTuple GetDatabaseTupleByOid(Oid dboid); > > static void PerformAuthentication(Port *port); > > static void CheckMyDatabase(const char *name, bool am_superuser, bool > override_allow_connections); > > static void InitCommunication(void); > > +static void ReleaseLWLocks(int code, Datum arg); > > static void ShutdownPostgres(int code, Datum arg); > > static void StatementTimeoutHandler(void); > > static void LockTimeoutHandler(void); > > @@ -653,6 +654,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const > char *username, > > * way, start up the XLOG machinery, and register to have > it closed > > * down at exit. > > */ > > + on_shmem_exit(ReleaseLWLocks, 0); > > StartupXLOG(); > > on_shmem_exit(ShutdownXLOG, 0); > > } > > @@ -1214,6 +1216,23 @@ process_settings(Oid databaseid, Oid roleid) > > heap_close(relsetting, AccessShareLock); > > } > > +/* > > + * There are 2 types of buffer locks on-holding when AtProcExit_Buffers() > is > > + * invoked in a bootstrap process or a standalone backend: > > + * (1) Exceptions thrown during StartupXLOG() > > + * (2) Exceptions thrown during exception-handling in ShutdownXLOG() > > + * So we need this on_shmem_exit callback for single user mode. > > + * For processes under postmaster, ShutdownAuxiliaryProcess() will release > > + * the lw-locks and ShutdownXLOG() is not registered as a callback, so > there > > + * is no such issue. Also, please note this callback should be registered > in > > + * the order after AtProcExit_buffers() and before ShutdownXLOG(). > > + */ > > +static void > > +ReleaseLWLocks(int code, Datum arg) > > +{ > > + LWLockReleaseAll(); > > +} > > + > > /* > > * Backend-shutdown callback. Do cleanup that we want to be sure happens > > * before all the supporting modules begin to nail their doors shut via > > > > The fix proposal is also attached to this email in file > “fix-deadlock.patch”. > > > > Please let us know should you have suggestions on this issue and the fix. > > > > Thank you! > > > > Best regards, > > -- > > Chengchao Yu > > Software Engineer | Microsoft | Azure Database for PostgreSQL > > https://azure.microsoft.com/en-us/services/postgresql/ > > >