Thank you very much for testing this! At Tue, 7 Feb 2017 13:28:42 +0900, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote in <9058d70b-a6b0-8b3c-091a-fe77ed0df...@lab.ntt.co.jp> > Horiguchi-san, > > On 2017/01/31 12:45, Kyotaro HORIGUCHI wrote: > > I noticed that this patch is conflicting with 665d1fa (Logical > > replication) so I rebased this. Only executor/Makefile > > conflicted. > > With the latest set of patches, I observe a crash due to an Assert failure: > > #3 0x00000000006883ed in ExecAsyncEventWait (estate=0x13c01b8, > timeout=-1) at execAsync.c:345
This means no pending fdw scan didn't let itself go to waiting stage. It leads to a stuck of the whole things. This is caused if no one acutually is waiting for result. I suppose that all of the foreign scans ran on the same connection. Anyway it should be a mistake in state transition. I'll look into it. > I was running a query whose plan looked like: > > explain (costs off) select tableoid::regclass, a, min(b), max(b) from ptab > group by 1,2 order by 1; > QUERY PLAN > ------------------------------------------------------ > Sort > Sort Key: ((ptab.tableoid)::regclass) > -> HashAggregate > Group Key: (ptab.tableoid)::regclass, ptab.a > -> Result > -> Append > -> Foreign Scan on ptab_00001 > -> Foreign Scan on ptab_00002 > -> Foreign Scan on ptab_00003 > -> Foreign Scan on ptab_00004 > -> Foreign Scan on ptab_00005 > -> Foreign Scan on ptab_00006 > -> Foreign Scan on ptab_00007 > -> Foreign Scan on ptab_00008 > -> Foreign Scan on ptab_00009 > -> Foreign Scan on ptab_00010 > <snip> > > The snipped part contains Foreign Scans on 90 more foreign partitions (in > fact, I could see the crash even with 10 foreign table partitions for the > same query). Yeah, it seems to me unrelated to how many they are. > There is a crash in one more case, which seems related to how WaitEventSet > objects are manipulated during resource-owner-mediated cleanup of a failed > query, such as after the FDW returned an error like below: > > ERROR: relation "public.ptab_00010" does not exist > CONTEXT: Remote SQL command: SELECT a, b FROM public.ptab_00010 > > The backtrace in this looks like below: > > Program terminated with signal SIGSEGV, Segmentation fault. > #0 0x00000000009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf, > value=20645152) at resowner.c:301 > 301 lastidx = resarr->lastidx; > (gdb) > (gdb) bt > #0 0x00000000009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf, > value=20645152) at resowner.c:301 > #1 0x00000000009c6578 in ResourceOwnerForgetWES > (owner=0x7f7f7f7f7f7f7f7f, events=0x13b0520) at resowner.c:1317 > #2 0x0000000000806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600 > #3 0x00000000009c5338 in ResourceOwnerReleaseInternal (owner=0x12de768, > phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1 '\001') > at resowner.c:566 > #4 0x00000000009c5155 in ResourceOwnerRelease (owner=0x12de768, > phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1 > '\001') at resowner.c:485 > #5 0x0000000000524172 in AbortTransaction () at xact.c:2588 > #6 0x0000000000524854 in AbortCurrentTransaction () at xact.c:3016 > #7 0x0000000000836aa6 in PostgresMain (argc=1, argv=0x12d7b08, > dbname=0x12d7968 "postgres", username=0x12d7948 "amit") at postgres.c:3860 > #8 0x00000000007a49d8 in BackendRun (port=0x12cdf00) at postmaster.c:4310 > #9 0x00000000007a4151 in BackendStartup (port=0x12cdf00) at postmaster.c:3982 > #10 0x00000000007a0885 in ServerLoop () at postmaster.c:1722 > #11 0x000000000079febf in PostmasterMain (argc=3, argv=0x12aacc0) at > postmaster.c:1330 > #12 0x00000000006e7549 in main (argc=3, argv=0x12aacc0) at main.c:228 > > There is a segfault when accessing the events variable, whose members seem > to be pfreed: > > (gdb) f 2 > #2 0x0000000000806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600 > 600 ResourceOwnerForgetWES(set->resowner, set); > (gdb) p *set > $5 = { > nevents = 2139062143, > nevents_space = 2139062143, > resowner = 0x7f7f7f7f7f7f7f7f, > events = 0x7f7f7f7f7f7f7f7f, > latch = 0x7f7f7f7f7f7f7f7f, > latch_pos = 2139062143, > epoll_fd = 2139062143, > epoll_ret_events = 0x7f7f7f7f7f7f7f7f > } Mmm, I reproduces it quite easily. A silly bug. Something bad is happening between freeing ExecutorState memory context and resource owner. Perhaps the ExecutorState is freed by resowner (as a part of its anscestors) before the memory for the WaitEventSet is freed. It was careless of me. I'll reconsider it. Great thanks for the report. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers