On 2/7/23 01:09, Thomas Munro wrote:
> On Tue, Feb 7, 2023 at 1:06 PM Tomas Vondra
> <tomas.von...@enterprisedb.com> wrote:
>> On 2/7/23 00:48, Thomas Munro wrote:
>>> On Tue, Feb 7, 2023 at 12:46 PM Tomas Vondra
>>> <tomas.von...@enterprisedb.com> wrote:
>>>> No, I left the workload as it was for the first lockup, so `make check`
>>>> runs everything as is up until the "join" test suite.
>>>
>>> Wait, shouldn't that be join_hash?
>>
>> No, because join_hash does not exist on 11 (it was added in 12). Also,
>> it actually locked up like this - that's the lockup I reported on 28/1.
> 
> Oh, good.  I had been trying to repro with 12 here and forgot that you
> were looking at 11...

FYI it happened again, on a regular run of regression tests (I gave up
on trying to reproduce this - after some initial hits I didn't hit it in
a couple thousand tries so I just added the machine back to buildfarm).

Anyway, same symptoms - lockup in join_hash on PG11, leader waiting on
WaitLatch and both workers waiting on BarrierArriveAndWait. I forgot
running gdb on the second worker will get it unstuck, so I haven't been
able to collect more info.

What else do you think would be useful to collect next time?

It's hard to draw conclusions due to the low probability of the issue,
but it's pretty weird this only ever happened on 11 so far.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
 1134  -  ICs       0:04.43 dhclient: genet0 (dhclient)
 1135  -  Is        0:05.10 /sbin/devd
 1333  -  Is        0:30.12 /usr/sbin/syslogd -s
 1433  -  Is        0:13.19 sshd: /usr/sbin/sshd [listener] 0 of 10-100 
startups (sshd)
 1437  -  Is        0:58.05 /usr/sbin/cron -s
54451  -  Is        0:00.07 screen (screen-4.9.0)
59248  -  Ss        9:52.76 /usr/sbin/ntpd -p /var/db/ntp/ntpd.pid -c 
/etc/ntp.conf -f /var/db/ntp/ntpd.drift
73926  -  I         0:00.01 cron: running job (cron)
73927  -  Is        0:00.04 /usr/local/bin/bash ./run.sh
73929  -  I         0:01.98 perl ./run_branches.pl --run-all --verbose
73931  -  I         0:01.58 /usr/local/bin/perl ./run_build.pl --verbose 
--config ./build-farm.conf REL_11_STABLE
77639  -  I         0:00.02 sh -c { cd pgsql.build/src/test/regress && gmake 
NO_LOCALE=1 check; echo $? > 
/mnt/data/buildfarm/buildroot/REL_11_STABLE/dikkop.lastrun-logs/laststatus; } > 
/mnt/data/buildfarm/buildroot/REL_11_STABLE/dikkop.
77640  -  I         0:00.04 gmake NO_LOCALE=1 check
78893  -  I         0:00.33 ../../../src/test/regress/pg_regress 
--temp-instance=./tmp_check --inputdir=. --bindir= 
--temp-config=/tmp/buildfarm-VLvzXQ/bfextra.conf --no-locale --port=5718 
--dlpath=. --max-concurrent-tests=20 --port=5718
78902  -  I         1:18.47 postgres -D 
/mnt/data/buildfarm/buildroot/REL_11_STABLE/pgsql.build/src/test/regress/./tmp_check/data
 -F -c listen_addresses= -k /tmp/pg_regress-UyjmKQ
78905  -  Is        0:01.14 postgres: checkpointer    (postgres)
78906  -  Ss        0:02.18 postgres: background writer    (postgres)
78907  -  Ss        0:02.57 postgres: walwriter    (postgres)
78908  -  Is        0:10.60 postgres: autovacuum launcher    (postgres)
78909  -  Is        0:41.88 postgres: stats collector    (postgres)
78910  -  Is        0:00.31 postgres: logical replication launcher    (postgres)
79193  -  I         0:00.41 psql -X -a -q -d regression
79215  -  Is        0:07.66 postgres: bsd regression [local] SELECT (postgres)
79301  -  Is        0:00.08 postgres: parallel worker for PID 79215    
(postgres)
79302  -  Is        0:00.05 postgres: parallel worker for PID 79215    
(postgres)
93662  -  Is        0:00.14 sshd: bsd [priv] (sshd)
93664  -  S         0:00.06 sshd: bsd@pts/0 (sshd)
 1459 u0  Is+       0:00.27 /usr/libexec/getty 3wire ttyu0
 1451 v0  Is        0:00.05 login [pam] (login)
 1460 v0  I+        0:00.08 -sh (sh)


[Switching to LWP 100121 of process 79215]
_poll () at _poll.S:4
4       _poll.S: No such file or directory.
(gdb) bt
#0  _poll () at _poll.S:4
#1  0x00000000884ccc60 in __thr_poll (fds=0x4, nfds=1, timeout=-1) at 
/usr/src/lib/libthr/thread/thr_syscalls.c:338
#2  0x0000000000804e80 in WaitEventSetWaitBlock (set=0x9c945be0, 
cur_timeout=-1, occurred_events=0x814f4a28, nevents=1) at latch.c:1171
#3  WaitEventSetWait (set=set@entry=0x9c945be0, timeout=<optimized out>, 
timeout@entry=-1, occurred_events=<optimized out>, 
occurred_events@entry=0x814f4a28, nevents=nevents@entry=1, 
wait_event_info=<optimized out>, 
    wait_event_info@entry=134217731) at latch.c:1000
#4  0x0000000000804af0 in WaitLatchOrSocket (latch=0x9bc30d18, 
wakeEvents=wakeEvents@entry=1, sock=-1, timeout=-1, wait_event_info=134217731) 
at latch.c:385
#5  0x00000000008049dc in WaitLatch (latch=0x4, wakeEvents=wakeEvents@entry=1, 
timeout=2286734372, wait_event_info=134217731) at latch.c:339
#6  0x00000000006e3614 in gather_readnext (gatherstate=<optimized out>) at 
nodeGather.c:367
#7  gather_getnext (gatherstate=0x912f5e98) at nodeGather.c:256
#8  ExecGather (pstate=0x912f5e98) at nodeGather.c:207
#9  0x00000000006f2da0 in ExecProcNode (node=0x912f5e98) at 
../../../src/include/executor/executor.h:248
#10 ExecNestLoop (pstate=0x912f5818) at nodeNestloop.c:160
#11 0x00000000006df71c in ExecProcNode (node=0x912f5818) at 
../../../src/include/executor/executor.h:248
#12 fetch_input_tuple (aggstate=aggstate@entry=0x912f5368) at nodeAgg.c:407
#13 0x00000000006de968 in agg_retrieve_direct (aggstate=0x912f5368) at 
nodeAgg.c:1843
#14 ExecAgg (pstate=0x912f5368) at nodeAgg.c:1561
#15 0x00000000006cd544 in ExecProcNode (node=0x912f5368) at 
../../../src/include/executor/executor.h:248
#16 ExecutePlan (estate=0x912f5118, use_parallel_mode=<optimized out>, 
operation=CMD_SELECT, numberTuples=0, direction=<optimized out>, 
dest=0x9cda6c58, planstate=<optimized out>, sendTuples=<optimized out>, 
execute_once=<optimized out>)
    at execMain.c:1712
#17 standard_ExecutorRun (queryDesc=0x911c4118, direction=<optimized out>, 
count=0, execute_once=<optimized out>) at execMain.c:353
#18 0x00000000008304dc in PortalRunSelect (portal=portal@entry=0x9127b118, 
forward=<optimized out>, count=0, count@entry=9223372036854775807, 
dest=dest@entry=0x9cda6c58) at pquery.c:941
#19 0x00000000008300f0 in PortalRun (portal=portal@entry=0x9127b118, 
count=<optimized out>, isTopLevel=<optimized out>, run_once=<optimized out>, 
dest=dest@entry=0x9cda6c58, altdest=<optimized out>, altdest@entry=0x9cda6c58, 
    completionTag=completionTag@entry=0x814f4fe0 "") at pquery.c:782
#20 0x000000000082f110 in exec_simple_query (
    query_string=query_string@entry=0x91189118 "select count(*) from join_foo\n 
 left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) 
ss\n  on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;")
    at postgres.c:1144
#21 0x000000000082cf84 in PostgresMain (argc=<optimized out>, argv=<optimized 
out>, dbname=<optimized out>, username=<optimized out>) at postgres.c:4225
#22 0x00000000007abbd8 in BackendRun (port=port@entry=0x91254000) at 
postmaster.c:4429
#23 0x00000000007ab20c in BackendStartup (port=0x91254000) at postmaster.c:4093
#24 ServerLoop () at postmaster.c:1728
#25 0x00000000007a879c in PostmasterMain (argc=<optimized out>, argc@entry=8, 
argv=<optimized out>, argv@entry=0x814f5f90) at postmaster.c:1401
#26 0x0000000000718054 in main (argc=8, argv=0x814f5f90) at main.c:228
(gdb) 



[Switching to LWP 100109 of process 79301]
_poll () at _poll.S:4
4       _poll.S: No such file or directory.
(gdb) bt
#0  _poll () at _poll.S:4
#1  0x00000000884ccc60 in __thr_poll (fds=0x4, nfds=2, timeout=-1) at 
/usr/src/lib/libthr/thread/thr_syscalls.c:338
#2  0x0000000000804e80 in WaitEventSetWaitBlock (set=0x9125d6c8, 
cur_timeout=-1, occurred_events=0x814f4628, nevents=1) at latch.c:1171
#3  WaitEventSetWait (set=0x9125d6c8, timeout=<optimized out>, 
occurred_events=<optimized out>, occurred_events@entry=0x814f4628, 
nevents=nevents@entry=1, wait_event_info=<optimized out>, 
wait_event_info@entry=134217738) at latch.c:1000
#4  0x00000000008241fc in ConditionVariableSleep (cv=cv@entry=0x9f4aa2a8, 
wait_event_info=wait_event_info@entry=134217738) at condition_variable.c:157
#5  0x00000000008018d8 in BarrierArriveAndWait 
(barrier=barrier@entry=0x9f4aa290, wait_event_info=134217738) at barrier.c:191
#6  0x00000000006e9010 in ExecHashJoinImpl (pstate=0x9d3fab70, parallel=true) 
at nodeHashjoin.c:349
#7  ExecParallelHashJoin (pstate=0x9d3fab70) at nodeHashjoin.c:630
#8  0x00000000006cd544 in ExecProcNode (node=0x9d3fab70) at 
../../../src/include/executor/executor.h:248
#9  ExecutePlan (estate=0x9d3fa118, use_parallel_mode=<optimized out>, 
operation=CMD_SELECT, numberTuples=0, direction=<optimized out>, 
dest=0x9127ca78, planstate=<optimized out>, sendTuples=<optimized out>, 
execute_once=<optimized out>)
    at execMain.c:1712
#10 standard_ExecutorRun (queryDesc=0x913685a0, direction=<optimized out>, 
count=0, execute_once=<optimized out>) at execMain.c:353
#11 0x00000000006d19d0 in ParallelQueryMain (seg=seg@entry=0x91192fb8, 
toc=toc@entry=0x9f4aa000) at execParallel.c:1402
#12 0x00000000005b482c in ParallelWorkerMain (main_arg=<optimized out>) at 
parallel.c:1433
#13 0x000000000079dfac in StartBackgroundWorker () at bgworker.c:911
#14 0x00000000007aa784 in do_start_bgworker (rw=0x911b7c80) at postmaster.c:5823
#15 maybe_start_bgworkers () at postmaster.c:6048
#16 0x00000000007a90c0 in sigusr1_handler (postgres_signal_arg=<optimized out>) 
at postmaster.c:5161
#17 0x00000000884ca794 in handle_signal (actp=actp@entry=0x814f5020, 
sig=sig@entry=30, info=info@entry=0x814f5090, ucp=ucp@entry=0x814f50e0) at 
/usr/src/lib/libthr/thread/thr_sig.c:303
#18 0x00000000884c9ed4 in thr_sighandler (sig=30, info=0x814f5090, 
_ucp=0x814f50e0) at /usr/src/lib/libthr/thread/thr_sig.c:246
#19 <signal handler called>
#20 _select () at _select.S:4
#21 0x00000000884ccf1c in __thr_select (numfds=4, readfds=0x814f5500, 
writefds=0x0, exceptfds=0x0, timeout=0x814f5610) at 
/usr/src/lib/libthr/thread/thr_syscalls.c:495
#22 0x00000000007aacbc in ServerLoop () at postmaster.c:1692
#23 0x00000000007a879c in PostmasterMain (argc=<optimized out>, argc@entry=8, 
argv=<optimized out>, argv@entry=0x814f5f90) at postmaster.c:1401
#24 0x0000000000718054 in main (argc=8, argv=0x814f5f90) at main.c:228
(gdb) q
A debugging session is active.

        Inferior 1 [process 79301] will be detached.

Quit anyway? (y or n) y



_poll () at _poll.S:4
4       _poll.S: No such file or directory.
(gdb) bt
#0  _poll () at _poll.S:4
#1  0x00000000884ccc60 in __thr_poll (fds=0x4, nfds=2, timeout=-1) at 
/usr/src/lib/libthr/thread/thr_syscalls.c:338
#2  0x0000000000804e80 in WaitEventSetWaitBlock (set=0x9125d6c8, 
cur_timeout=-1, occurred_events=0x814f4578, nevents=1) at latch.c:1171
#3  WaitEventSetWait (set=0x9125d6c8, timeout=<optimized out>, 
occurred_events=<optimized out>, occurred_events@entry=0x814f4578, 
nevents=nevents@entry=1, wait_event_info=<optimized out>, 
wait_event_info@entry=134217737) at latch.c:1000
#4  0x00000000008241fc in ConditionVariableSleep (cv=cv@entry=0x9f4aa2a8, 
wait_event_info=wait_event_info@entry=134217737) at condition_variable.c:157
#5  0x00000000008018d8 in BarrierArriveAndWait 
(barrier=barrier@entry=0x9f4aa290, wait_event_info=134217737) at barrier.c:191
#6  0x00000000006e3b98 in MultiExecParallelHash (node=0x9c8ae908) at 
nodeHash.c:312
#7  MultiExecHash (node=node@entry=0x9c8aeb70) at nodeHash.c:112
#8  0x00000000006d42b8 in MultiExecProcNode (node=node@entry=0x9c8ae908) at 
execProcnode.c:502
#9  0x00000000006e8b50 in ExecHashJoinImpl (pstate=0x9c8aeb70, parallel=true) 
at nodeHashjoin.c:296
#10 ExecParallelHashJoin (pstate=0x9c8aeb70) at nodeHashjoin.c:630
#11 0x00000000006cd544 in ExecProcNode (node=0x9c8aeb70) at 
../../../src/include/executor/executor.h:248
#12 ExecutePlan (estate=0x9c8ae118, use_parallel_mode=<optimized out>, 
operation=CMD_SELECT, numberTuples=0, direction=<optimized out>, 
dest=0x9127ca78, planstate=<optimized out>, sendTuples=<optimized out>, 
execute_once=<optimized out>)
    at execMain.c:1712
#13 standard_ExecutorRun (queryDesc=0x913685a0, direction=<optimized out>, 
count=0, execute_once=<optimized out>) at execMain.c:353
#14 0x00000000006d19d0 in ParallelQueryMain (seg=seg@entry=0x91192fb8, 
toc=toc@entry=0x9f4aa000) at execParallel.c:1402
#15 0x00000000005b482c in ParallelWorkerMain (main_arg=<optimized out>) at 
parallel.c:1433
#16 0x000000000079dfac in StartBackgroundWorker () at bgworker.c:911
#17 0x00000000007aa784 in do_start_bgworker (rw=0x911b7780) at postmaster.c:5823
#18 maybe_start_bgworkers () at postmaster.c:6048
#19 0x00000000007a90c0 in sigusr1_handler (postgres_signal_arg=<optimized out>) 
at postmaster.c:5161
#20 0x00000000884ca794 in handle_signal (actp=actp@entry=0x814f5020, 
sig=sig@entry=30, info=info@entry=0x814f5090, ucp=ucp@entry=0x814f50e0) at 
/usr/src/lib/libthr/thread/thr_sig.c:303
#21 0x00000000884c9ed4 in thr_sighandler (sig=30, info=0x814f5090, 
_ucp=0x814f50e0) at /usr/src/lib/libthr/thread/thr_sig.c:246
#22 <signal handler called>
#23 _select () at _select.S:4
#24 0x00000000884ccf1c in __thr_select (numfds=4, readfds=0x814f5500, 
writefds=0x0, exceptfds=0x0, timeout=0x814f5610) at 
/usr/src/lib/libthr/thread/thr_syscalls.c:495
#25 0x00000000007aacbc in ServerLoop () at postmaster.c:1692
#26 0x00000000007a879c in PostmasterMain (argc=<optimized out>, argc@entry=8, 
argv=<optimized out>, argv@entry=0x814f5f90) at postmaster.c:1401
#27 0x0000000000718054 in main (argc=8, argv=0x814f5f90) at main.c:228
(gdb) 

Reply via email to