On Fri, Feb 28, 2020 at 9:08 PM Kyotaro Horiguchi <horikyota....@gmail.com> wrote: > - v2-0001-Allow-wait-event-set-to-be-regsitered-to-resoure.patch > The async feature uses WaitEvent, and it needs to be released on > error. This patch makes it possible to register WaitEvent to > resowner to handle that case..
+1 > - v2-0002-infrastructure-for-asynchronous-execution.patch > It povides an abstraction layer of asynchronous behavior > (execAsync). Then adds ExecAppend, another version of ExecAppend, > that handles "async-capable" subnodes asynchronously. Also it > contains planner part that makes planner aware of "async-capable" > and "async-aware" path nodes. > This patch add an infrastructure for asynchronous execution. As a PoC > this makes only Append capable to handle asynchronously executable > subnodes. What other nodes do you think could be async aware? I suppose you would teach joins to pass through the async support of their children, and then you could make partition-wise join work like that. + /* choose appropriate version of Exec function */ + if (appendstate->as_nasyncplans == 0) + appendstate->ps.ExecProcNode = ExecAppend; + else + appendstate->ps.ExecProcNode = ExecAppendAsync; Cool. No extra cost for people not using the new feature. + slot = ExecProcNode(subnode); + if (subnode->asyncstate == AS_AVAILABLE) So, now when you execute a node, you get a result AND you get some information that you access by reaching into the child node's PlanState. The ExecProcNode() interface is extremely limiting, but I'm not sure if this is the right way to extend it. Maybe ExecAsyncProcNode() with a wide enough interface to do the job? +Bitmapset * +ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout) +{ + static int *refind = NULL; + static int refindsize = 0; ... + if (refindsize < n) ... + static ExecAsync_mcbarg mcb_arg = + { &refind, &refindsize }; + static MemoryContextCallback mcb = + { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL }; ... + MemoryContextRegisterResetCallback(TopTransactionContext, &mcb); This seems a bit strange. Why not just put the pointer in the plan state? I suppose you want to avoid allocating a new buffer for every query. Perhaps you could fix that by having a small fixed-size buffer in the PlanState to cover common cases and allocating a larger one in a per-query memory context if that one is too small, or just not worrying about it and allocating every time since you know the desired size. + wes = CreateWaitEventSet(TopTransactionContext, TopTransactionResourceOwner, n); ... + FreeWaitEventSet(wes); BTW, just as an FYI, I am proposing[1] to add support for RemoveWaitEvent(), so that you could have a single WaitEventSet for the lifetime of the executor node, and then add and remove sockets only as needed. I'm hoping to commit that for PG13, if there are no objections or better ideas soon, because it's useful for some other places where we currently create and destroy WaitEventSets frequently. One complication when working with long-lived WaitEventSet objects is that libpq (or some other thing used by some other hypothetical async-capable FDW) is free to close and reopen its socket whenever it wants, so you need a way to know when it has done that. In that patch set I added pqSocketChangeCount() so that you can see when pgSocket() refers to a new socket (even if the file descriptor number is the same by coincidence), but that imposes some book-keeping duties on the caller, and now I'm wondering how that would look in your patch set. My goal is to generate the minimum number of systems calls. I think it would be nice if a 1000-shard query only calls epoll_ctl() when a child node needs to be added or removed from the set, not epoll_create(), 1000 * epoll_ctl(), epoll_wait(), close() for every wait. But I suppose there is an argument that it's more complication than it's worth. [1] https://commitfest.postgresql.org/27/2452/