> On Sat, Jul 25, 2015 at 11:13 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote: > > I'm recently working/investigating on ParallelAppend feature > > towards the next commit fest. Below is my design proposal. > > > > 1. Concept > > ---------- > > Its concept is quite simple anybody might consider more than once. > > ParallelAppend node kicks background worker process to execute > > child nodes in parallel / asynchronous. > > It intends to improve the performance to scan a large partitioned > > tables from standpoint of entire throughput, however, latency of > > the first multi-hundred rows are not scope of this project. > > From standpoint of technology trend, it primarily tries to utilize > > multi-cores capability within a system, but also enables to expand > > distributed database environment using foreign-tables inheritance > > features. > > Its behavior is very similar to Funnel node except for several > > points, thus, we can reuse its infrastructure we have had long- > > standing discussion through the v9.5 development cycle. > > Now that I've got more of the parallel infrastructure committed, I've > been starting to have a little time to think about what we might want > to do after we get PartialSeqScan committed. I'm positive on doing > something with the Append node and parallelism, but I'm negative on > doing what you've described here. > > I don't think the Append node has any business launching workers. > That's the job of Gather. Append may need to do something > parallel-aware, but starting workers is not that thing. Without > making any changes at all to Append, we can use it like this: > > Gather > -> Append > -> Partial Seq Scan on p1 > -> Partial Seq Scan on p2 > -> Partial Seq Scan on p3 > > The Gather node will spin up workers and in each worker we will ask > the Append nodes for tuples. Append will ask the first > not-yet-completed child for tuples, so the workers will cooperatively > scan first p1, then p2, then p3. This is great: instead of merely > doing a parallel seq scan of a single table, we can do a parallel seq > scan of a partitioned table. However, there are two improvements we > can make. First, we can teach Append that, when running in parallel, > it should initialize a chunk of dynamic shared memory with an array > indicating how many workers are currently working on each subplan. > Each new worker should join a subplan with the minimum number of > workers, work on that one until it's completely, and then pick a new > subplan. This minimizes contention. Second, we can teach the Append > to serve as a parent not only for intrinsically parallel nodes like > Partial Seq Scan, but also for other nodes, say, an Index Scan. When > an Append is running in parallel but with non-parallel-aware children, > each such child can be claimed by at most one worker process and will > be run to completion by that worker process. For example: > > Gather > -> Append > -> Index Scan on p1 > -> Partial Seq Scan on p2 > -> Index Scan on p3 > > The first worker which executes the Append should begin the index scan > on p1 and the second should begin the index scan on p3. The remaining > workers, and those two once they're finished, can work on p2. > > Proceeding in this way, I don't think we need a separate Parallel > Append node. Rather, we can just give the existing Append node some > extra smarts that are used only when it's running in parallel. > I entirely agree with your suggestion.
We may be able to use an analogy between PartialSeqScan and the parallel- aware Append node. PartialSeqScan fetches blocks pointed by the index on shared memory segment, thus multiple workers eventually co-operate to scan a table using round-robin manner even though individual worker fetches comb- shaped blocks. If we assume individual blocks are individual sub-plans on the parallel aware Append, it performs very similar. A certain number of workers (more than zero) is launched by Gather node, then the parallel aware Append node fetches one of the sub-plans if any. I think, this design also gives additional flexibility according to the required parallelism by the underlying sub-plans. Please assume the "PartialSeqScan on p2" in the above example wants 3 workers to process the scan, we can expand the virtual array of the sub-plans as follows. Then, if Gather node kicks 5 workers, individual workers are assigned on some of plans. If Gather node could kick less than 5 workers, the first exit worker picks the second sub-plan, then it eventually provides the best parallelism. +--------+ |sub-plan | * Sub-Plan 1 ... Index Scan on p1 |index on *-----> * Sub-Plan 2 ... PartialSeqScan on p2 |shared | * Sub-Plan 2 ... PartialSeqScan on p2 |memory | * Sub-Plan 2 ... PartialSeqScan on p2 +---------+ * Sub-Plan 3 ... Index Scan on p3 Here is no matter even if Append node has multiple parallel-aware sub-plans. When "PartialSeqScan on p4" is added, all we need to do is expand the above virtual array of the sub-plans. For more generic plan construction, Plan node may have a field for number of "desirable" workers even though most of plan-nodes are not parallel aware, and it is not guaranteed. In above case, the parallel aware Append will want 5 workers in total (2 by 2 index-scans, plus 3 by a partial-seq-scan). It is a discretion of Gather node how many workers are actually launched, however, it will give clear information how many workers are best. > First, we can teach Append that, when running in parallel, > it should initialize a chunk of dynamic shared memory with an array > indicating how many workers are currently working on each subplan. > Can the parallel-aware Append can recognize the current mode using MyBgworkerEntry whether it is valid or not? > We can also push other things in between the Gather and the Append, > which wouldn't be possible in your design. For example, consider a > join between a partitioned table p and an unpartitioned table q. We > could do this: > > Gather > -> Nested Loop > -> Append > -> Index Scan on p1 > -> Partial Seq Scan on p2 > -> Index Scan on p3 > -> Index Scan on q > Index Cond q.x = p.x > > That's a pretty sweet plan. Assuming p1, p2, and p3 are all > reasonably large, we could probably benefit from throwing at least 3 > processes at this plan tree - maybe more, if p2 is really big. Only > one process can work on each of p1 and p3, but p2, since it has a > truly parallel plan, can soak up as many as we want to throw at it > (although performance may top out at some point if we're I/O-bound). > We can also leverage the parallel capability on hash-join and merge-join (even though it is not all the cases; in case when Append node runs on the partitioned table with CHECK() constraint). It is the reason why my colleague works for feature of the join before append. http://www.postgresql.org/message-id/12a9442fbae80d4e8953883e0b84e088606...@bpxm01gp.gisp.nec.co.jp > Sorry for taking so long to give a really substantive reply on this, > but it wasn't until this last week or so that I really had time to > think about this in detail. > No worry, I also couldn't make a valid progress due to various jobs not only community works... Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kai...@ak.jp.nec.com> -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers