On Sat, Feb 26, 2022 at 2:07 PM Tom Lane <t...@sss.pgh.pa.us> wrote:
>
> About once a month over the last six months, my buildfarm animal
> florican has gotten stuck while running the core regression tests.
> The symptoms have looked very much the same each time: there is
> a backend with two parallel worker processes that are just sitting
> and not consuming any CPU time.  Each time I've attached to these
> processes with gdb to check their stack traces, and seen pretty
> much the same story every time (traces below).  What is really
> interesting is that after I detach from the second worker, the
> processes resume running and finish out the test successfully!
> I don't know much about how gdb interacts with kernel calls on
> FreeBSD, but I speculate that the poll(2) call returns with EINTR
> after gdb releases the process, and then things resume fine,
> suggesting that we lost an interrupt somewhere.
>
> I have observed this three times in the REL_11 branch, once
> in REL_12, and a couple of times last summer before it occurred
> to me to start keeping notes.  Over that time the machine has
> been running various patchlevels of FreeBSD 13.0.
>
> Here's the stack trace from the leader process in the most
> recent event (REL_11 as of yesterday).  It's not always the
> same query that gets stuck, but it's always a parallel hash join:
>
> (gdb) p debug_query_string
> $1 = 0x21873090 "select count(*) from simple r join simple s using (id);"
> (gdb) bt
> #0  _poll () at _poll.S:4
> #1  0x21701361 in __thr_poll (fds=0x219dc170, nfds=2, timeout=-1) at 
> /usr/src/lib/libthr/thread/thr_syscalls.c:338
> #2  0x215eaf3f in poll (pfd=0x219dc170, nfds=2, timeout=-1) at 
> /usr/src/lib/libc/sys/poll.c:47
> #3  0x0097b0fd in WaitEventSetWaitBlock (set=<optimized out>, cur_timeout=-1, 
> occurred_events=<optimized out>, nevents=<optimized out>) at latch.c:1171
> #4  WaitEventSetWait (set=0x219dc138, timeout=-1, occurred_events=<optimized 
> out>, nevents=1, wait_event_info=134217738) at latch.c:1000
> #5  0x0099842b in ConditionVariableSleep (cv=0x2bf0b230, 
> wait_event_info=134217738) at condition_variable.c:157
> #6  0x00977e12 in BarrierArriveAndWait (barrier=0x2bf0b218, 
> wait_event_info=134217738) at barrier.c:191
> #7  0x00873441 in ExecHashJoinImpl (pstate=<optimized out>, parallel=true) at 
> nodeHashjoin.c:328
> #8  ExecParallelHashJoin (pstate=0x2a887740) at nodeHashjoin.c:600
> #9  0x0086a509 in ExecProcNode (node=0x2a887740) at 
> ../../../src/include/executor/executor.h:248
> #10 fetch_input_tuple (aggstate=aggstate@entry=0x2a887500) at nodeAgg.c:407
> #11 0x00869646 in agg_retrieve_direct (aggstate=<optimized out>) at 
> nodeAgg.c:1746
> #12 ExecAgg (pstate=0x2a887500) at nodeAgg.c:1561
> #13 0x0086e181 in ExecProcNode (node=0x2a887500) at 
> ../../../src/include/executor/executor.h:248
> #14 gather_getnext (gatherstate=0x2a887414) at nodeGather.c:276
> #15 ExecGather (pstate=0x2a887414) at nodeGather.c:207
> #16 0x0086a509 in ExecProcNode (node=0x2a887414) at 
> ../../../src/include/executor/executor.h:248
> #17 fetch_input_tuple (aggstate=aggstate@entry=0x2a8871b8) at nodeAgg.c:407
> #18 0x00869646 in agg_retrieve_direct (aggstate=<optimized out>) at 
> nodeAgg.c:1746
> #19 ExecAgg (pstate=0x2a8871b8) at nodeAgg.c:1561
> #20 0x0085956b in ExecProcNode (node=0x2a8871b8) at 
> ../../../src/include/executor/executor.h:248
> #21 ExecutePlan (estate=0x2a887090, planstate=0x2a8871b8, 
> use_parallel_mode=<optimized out>, operation=CMD_SELECT, 
> sendTuples=<optimized out>,
>     numberTuples=<optimized out>, direction=<optimized out>, dest=0x2ab629b4, 
> execute_once=<optimized out>) at execMain.c:1712
> #22 standard_ExecutorRun (queryDesc=0x218a1490, 
> direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at 
> execMain.c:353
> #23 0x00859445 in ExecutorRun (queryDesc=0x218a1490, 
> direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at 
> execMain.c:296
> #24 0x009a3d57 in PortalRunSelect (portal=portal@entry=0x2197f090, 
> forward=<optimized out>, count=0, dest=0x2ab629b4) at pquery.c:941
> #25 0x009a39d6 in PortalRun (portal=0x2197f090, count=2147483647, 
> isTopLevel=<optimized out>, run_once=<optimized out>, dest=0x2ab629b4, 
> altdest=0x2ab629b4,
>     completionTag=0xffbfdc70 "") at pquery.c:782
> #26 0x009a2b53 in exec_simple_query 
> (query_string=query_string@entry=0x21873090 "select count(*) from simple r 
> join simple s using (id);") at postgres.c:1145
> #27 0x009a04ec in PostgresMain (argc=1, argv=0x21954ed4, dbname=<optimized 
> out>, username=0x21954d38 "buildfarm") at postgres.c:4052
>
> Worker 1 looks like this:
>
> (gdb) bt
> #0  _poll () at _poll.S:4
> #1  0x21701361 in __thr_poll (fds=0x2187cb48, nfds=2, timeout=-1) at 
> /usr/src/lib/libthr/thread/thr_syscalls.c:338
> #2  0x215eaf3f in poll (pfd=0x2187cb48, nfds=2, timeout=-1) at 
> /usr/src/lib/libc/sys/poll.c:47
> #3  0x0097b0fd in WaitEventSetWaitBlock (set=<optimized out>, cur_timeout=-1, 
> occurred_events=<optimized out>, nevents=<optimized out>) at latch.c:1171
> #4  WaitEventSetWait (set=0x2187cb10, timeout=-1, occurred_events=<optimized 
> out>, nevents=1, wait_event_info=134217738) at latch.c:1000
> #5  0x0099842b in ConditionVariableSleep (cv=0x2196b230, 
> wait_event_info=134217738) at condition_variable.c:157
> #6  0x00977e12 in BarrierArriveAndWait (barrier=0x2196b218, 
> wait_event_info=134217738) at barrier.c:191
> #7  0x00873441 in ExecHashJoinImpl (pstate=<optimized out>, parallel=true) at 
> nodeHashjoin.c:328
> #8  ExecParallelHashJoin (pstate=0x2a8378d8) at nodeHashjoin.c:600
> #9  0x0086a509 in ExecProcNode (node=0x2a8378d8) at 
> ../../../src/include/executor/executor.h:248
> #10 fetch_input_tuple (aggstate=aggstate@entry=0x2a8376c8) at nodeAgg.c:407
> #11 0x00869646 in agg_retrieve_direct (aggstate=<optimized out>) at 
> nodeAgg.c:1746
> #12 ExecAgg (pstate=0x2a8376c8) at nodeAgg.c:1561
> #13 0x0085956b in ExecProcNode (node=0x2a8376c8) at 
> ../../../src/include/executor/executor.h:248
> #14 ExecutePlan (estate=0x2a837090, planstate=0x2a8376c8, 
> use_parallel_mode=<optimized out>, operation=CMD_SELECT, 
> sendTuples=<optimized out>,
>     numberTuples=<optimized out>, direction=<optimized out>, dest=0x219ac780, 
> execute_once=<optimized out>) at execMain.c:1712
> #15 standard_ExecutorRun (queryDesc=0x219ad3e0, 
> direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at 
> execMain.c:353
> #16 0x00859445 in ExecutorRun (queryDesc=0x219ad3e0, 
> direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at 
> execMain.c:296
> #17 0x0085d6de in ParallelQueryMain (seg=0x2187c89c, toc=0x2196b000) at 
> execParallel.c:1402
> #18 0x00750ee0 in ParallelWorkerMain (main_arg=1903254317) at parallel.c:1433
>
> Worker 2 looks like this:
>
> (gdb) bt
> #0  _poll () at _poll.S:4
> #1  0x21701361 in __thr_poll (fds=0x2187cb48, nfds=2, timeout=-1) at 
> /usr/src/lib/libthr/thread/thr_syscalls.c:338
> #2  0x215eaf3f in poll (pfd=0x2187cb48, nfds=2, timeout=-1) at 
> /usr/src/lib/libc/sys/poll.c:47
> #3  0x0097b0fd in WaitEventSetWaitBlock (set=<optimized out>, cur_timeout=-1, 
> occurred_events=<optimized out>, nevents=<optimized out>) at latch.c:1171
> #4  WaitEventSetWait (set=0x2187cb10, timeout=-1, occurred_events=<optimized 
> out>, nevents=1, wait_event_info=134217737) at latch.c:1000
> #5  0x0099842b in ConditionVariableSleep (cv=0x2196b230, 
> wait_event_info=134217737) at condition_variable.c:157
> #6  0x00977e12 in BarrierArriveAndWait (barrier=0x2196b218, 
> wait_event_info=134217737) at barrier.c:191
> #7  0x0086ea44 in MultiExecParallelHash (node=<optimized out>) at 
> nodeHash.c:312
> #8  MultiExecHash (node=0x2a837e88) at nodeHash.c:112
> #9  0x0085fa03 in MultiExecProcNode (node=0x2a837e88) at execProcnode.c:501
> #10 0x00872f80 in ExecHashJoinImpl (pstate=<optimized out>, parallel=true) at 
> nodeHashjoin.c:290
> #11 ExecParallelHashJoin (pstate=0x2a8378d8) at nodeHashjoin.c:600
> #12 0x0086a509 in ExecProcNode (node=0x2a8378d8) at 
> ../../../src/include/executor/executor.h:248
> #13 fetch_input_tuple (aggstate=aggstate@entry=0x2a8376c8) at nodeAgg.c:407
> #14 0x00869646 in agg_retrieve_direct (aggstate=<optimized out>) at 
> nodeAgg.c:1746
> #15 ExecAgg (pstate=0x2a8376c8) at nodeAgg.c:1561
> #16 0x0085956b in ExecProcNode (node=0x2a8376c8) at 
> ../../../src/include/executor/executor.h:248
> #17 ExecutePlan (estate=0x2a837090, planstate=0x2a8376c8, 
> use_parallel_mode=<optimized out>, operation=CMD_SELECT, 
> sendTuples=<optimized out>,
>     numberTuples=<optimized out>, direction=<optimized out>, dest=0x219ac780, 
> execute_once=<optimized out>) at execMain.c:1712
> #18 standard_ExecutorRun (queryDesc=0x219ad3e0, 
> direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at 
> execMain.c:353
> #19 0x00859445 in ExecutorRun (queryDesc=0x219ad3e0, 
> direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at 
> execMain.c:296
> #20 0x0085d6de in ParallelQueryMain (seg=0x2187c89c, toc=0x2196b000) at 
> execParallel.c:1402
> #21 0x00750ee0 in ParallelWorkerMain (main_arg=1903254317) at parallel.c:1433
> #22 0x009193d3 in StartBackgroundWorker () at bgworker.c:911
>
> In each case I attached to the process, ran the gdb commands shown
> above, and detached.  As I've grown used to seeing, the processes
> resumed running after I detached from the second worker.  I've not
> tried attaching in something other than PID order, but I probably
> will next time.
>
> Thoughts?  Ideas on debugging this?

How could it be that worker 2 is waiting on the build barrier in
PHJ_BUILD_HASHING_INNER and worker 1 and the leader are waiting on it
with it supposedly in PHJ_BUILD_HASHING_OUTER?


Reply via email to