Re: [HACKERS] asynchronous execution

Kyotaro HORIGUCHI Thu, 16 Feb 2017 04:07:49 -0800

Thank you very much for testing this!

At Tue, 7 Feb 2017 13:28:42 +0900, Amit Langote <[email protected]> 
wrote in <[email protected]>
> Horiguchi-san,
> 
> On 2017/01/31 12:45, Kyotaro HORIGUCHI wrote:
> > I noticed that this patch is conflicting with 665d1fa (Logical
> > replication) so I rebased this. Only executor/Makefile
> > conflicted.
> 
> With the latest set of patches, I observe a crash due to an Assert failure:
> 
> #3  0x00000000006883ed in ExecAsyncEventWait (estate=0x13c01b8,
> timeout=-1) at execAsync.c:345


This means no pending fdw scan didn't let itself go to waiting
stage. It leads to a stuck of the whole things. This is caused if
no one acutually is waiting for result. I suppose that all of the
foreign scans ran on the same connection. Anyway it should be a
mistake in state transition. I'll look into it.

> I was running a query whose plan looked like:
> 
> explain (costs off) select tableoid::regclass, a, min(b), max(b) from ptab
> group by 1,2 order by 1;
>                       QUERY PLAN
> ------------------------------------------------------
>  Sort
>    Sort Key: ((ptab.tableoid)::regclass)
>    ->  HashAggregate
>          Group Key: (ptab.tableoid)::regclass, ptab.a
>          ->  Result
>                ->  Append
>                      ->  Foreign Scan on ptab_00001
>                      ->  Foreign Scan on ptab_00002
>                      ->  Foreign Scan on ptab_00003
>                      ->  Foreign Scan on ptab_00004
>                      ->  Foreign Scan on ptab_00005
>                      ->  Foreign Scan on ptab_00006
>                      ->  Foreign Scan on ptab_00007
>                      ->  Foreign Scan on ptab_00008
>                      ->  Foreign Scan on ptab_00009
>                      ->  Foreign Scan on ptab_00010
>                <snip>
> 
> The snipped part contains Foreign Scans on 90 more foreign partitions (in
> fact, I could see the crash even with 10 foreign table partitions for the
> same query).

Yeah, it seems to me unrelated to how many they are.

> There is a crash in one more case, which seems related to how WaitEventSet
> objects are manipulated during resource-owner-mediated cleanup of a failed
> query, such as after the FDW returned an error like below:
> 
> ERROR:  relation "public.ptab_00010" does not exist
> CONTEXT:  Remote SQL command: SELECT a, b FROM public.ptab_00010
> 
> The backtrace in this looks like below:
> 
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x00000000009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf,
> value=20645152) at resowner.c:301
> 301                                   lastidx = resarr->lastidx;
> (gdb)
> (gdb) bt
> #0  0x00000000009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf,
> value=20645152) at resowner.c:301
> #1  0x00000000009c6578 in ResourceOwnerForgetWES
> (owner=0x7f7f7f7f7f7f7f7f, events=0x13b0520) at resowner.c:1317
> #2  0x0000000000806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600
> #3  0x00000000009c5338 in ResourceOwnerReleaseInternal (owner=0x12de768,
> phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1 '\001')
>     at resowner.c:566
> #4  0x00000000009c5155 in ResourceOwnerRelease (owner=0x12de768,
> phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1
> '\001') at resowner.c:485
> #5  0x0000000000524172 in AbortTransaction () at xact.c:2588
> #6  0x0000000000524854 in AbortCurrentTransaction () at xact.c:3016
> #7  0x0000000000836aa6 in PostgresMain (argc=1, argv=0x12d7b08,
> dbname=0x12d7968 "postgres", username=0x12d7948 "amit") at postgres.c:3860
> #8  0x00000000007a49d8 in BackendRun (port=0x12cdf00) at postmaster.c:4310
> #9  0x00000000007a4151 in BackendStartup (port=0x12cdf00) at postmaster.c:3982
> #10 0x00000000007a0885 in ServerLoop () at postmaster.c:1722
> #11 0x000000000079febf in PostmasterMain (argc=3, argv=0x12aacc0) at
> postmaster.c:1330
> #12 0x00000000006e7549 in main (argc=3, argv=0x12aacc0) at main.c:228
> 
> There is a segfault when accessing the events variable, whose members seem
> to be pfreed:
> 
> (gdb) f 2
> #2  0x0000000000806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600
> 600                   ResourceOwnerForgetWES(set->resowner, set);
> (gdb) p *set
> $5 = {
>   nevents = 2139062143,
>   nevents_space = 2139062143,
>   resowner = 0x7f7f7f7f7f7f7f7f,
>   events = 0x7f7f7f7f7f7f7f7f,
>   latch = 0x7f7f7f7f7f7f7f7f,
>   latch_pos = 2139062143,
>   epoll_fd = 2139062143,
>   epoll_ret_events = 0x7f7f7f7f7f7f7f7f
> }

Mmm, I reproduces it quite easily. A silly bug.

Something bad is happening between freeing ExecutorState memory
context and resource owner. Perhaps the ExecutorState is freed by
resowner (as a part of its anscestors) before the memory for the
WaitEventSet is freed. It was careless of me. I'll reconsider it.

Great thanks for the report.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] asynchronous execution

Reply via email to