Re: [HACKERS] [DESIGN] ParallelAppend

Kouhei Kaigai Sun, 25 Oct 2015 18:25:35 -0700

> On Sat, Jul 25, 2015 at 11:13 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > I'm recently working/investigating on ParallelAppend feature
> > towards the next commit fest. Below is my design proposal.
> >
> > 1. Concept
> > ----------
> > Its concept is quite simple anybody might consider more than once.
> > ParallelAppend node kicks background worker process to execute
> > child nodes in parallel / asynchronous.
> > It intends to improve the performance to scan a large partitioned
> > tables from standpoint of entire throughput, however, latency of
> > the first multi-hundred rows are not scope of this project.
> > From standpoint of technology trend, it primarily tries to utilize
> > multi-cores capability within a system, but also enables to expand
> > distributed database environment using foreign-tables inheritance
> > features.
> > Its behavior is very similar to Funnel node except for several
> > points, thus, we can reuse its infrastructure we have had long-
> > standing discussion through the v9.5 development cycle.
> 
> Now that I've got more of the parallel infrastructure committed, I've
> been starting to have a little time to think about what we might want
> to do after we get PartialSeqScan committed.  I'm positive on doing
> something with the Append node and parallelism, but I'm negative on
> doing what you've described here.
> 
> I don't think the Append node has any business launching workers.
> That's the job of Gather.  Append may need to do something
> parallel-aware, but starting workers is not that thing.  Without
> making any changes at all to Append, we can use it like this:
> 
> Gather
> -> Append
>   -> Partial Seq Scan on p1
>   -> Partial Seq Scan on p2
>   -> Partial Seq Scan on p3
> 
> The Gather node will spin up workers and in each worker we will ask
> the Append nodes for tuples.  Append will ask the first
> not-yet-completed child for tuples, so the workers will cooperatively
> scan first p1, then p2, then p3.  This is great: instead of merely
> doing a parallel seq scan of a single table, we can do a parallel seq
> scan of a partitioned table.  However, there are two improvements we
> can make.  First, we can teach Append that, when running in parallel,
> it should initialize a chunk of dynamic shared memory with an array
> indicating how many workers are currently working on each subplan.
> Each new worker should join a subplan with the minimum number of
> workers, work on that one until it's completely, and then pick a new
> subplan.  This minimizes contention.  Second, we can teach the Append
> to serve as a parent not only for intrinsically parallel nodes like
> Partial Seq Scan, but also for other nodes, say, an Index Scan.  When
> an Append is running in parallel but with non-parallel-aware children,
> each such child can be claimed by at most one worker process and will
> be run to completion by that worker process.  For example:
> 
> Gather
> -> Append
>   -> Index Scan on p1
>   -> Partial Seq Scan on p2
>   -> Index Scan on p3
> 
> The first worker which executes the Append should begin the index scan
> on p1 and the second should begin the index scan on p3.  The remaining
> workers, and those two once they're finished, can work on p2.
> 
> Proceeding in this way, I don't think we need a separate Parallel
> Append node.  Rather, we can just give the existing Append node some
> extra smarts that are used only when it's running in parallel.
>
I entirely agree with your suggestion.


We may be able to use an analogy between PartialSeqScan and the
parallel- aware Append node.
PartialSeqScan fetches blocks pointed by the index on shared memory
segment, thus multiple workers eventually co-operate to scan a table
using round-robin manner even though individual worker fetches comb-
shaped blocks.
If we assume individual blocks are individual sub-plans on the parallel
aware Append, it performs very similar. A certain number of workers
(more than zero) is launched by Gather node, then the parallel aware
Append node fetches one of the sub-plans if any.

I think, this design also gives additional flexibility according to
the required parallelism by the underlying sub-plans.
Please assume the "PartialSeqScan on p2" in the above example wants
3 workers to process the scan, we can expand the virtual array of
the sub-plans as follows. Then, if Gather node kicks 5 workers,
individual workers are assigned on some of plans. If Gather node
could kick less than 5 workers, the first exit worker picks the
second sub-plan, then it eventually provides the best parallelism.

+--------+
|sub-plan |       * Sub-Plan 1 ... Index Scan on p1
|index on *-----> * Sub-Plan 2 ... PartialSeqScan on p2
|shared   |       * Sub-Plan 2 ... PartialSeqScan on p2
|memory   |       * Sub-Plan 2 ... PartialSeqScan on p2
+---------+       * Sub-Plan 3 ... Index Scan on p3

Here is no matter even if Append node has multiple parallel-aware
sub-plans. When "PartialSeqScan on p4" is added, all we need to do
is expand the above virtual array of the sub-plans.

For more generic plan construction, Plan node may have a field for
number of "desirable" workers even though most of plan-nodes are not
parallel aware, and it is not guaranteed.
In above case, the parallel aware Append will want 5 workers in total
(2 by 2 index-scans, plus 3 by a partial-seq-scan). It is a discretion
of Gather node how many workers are actually launched, however, it
will give clear information how many workers are best.

> First, we can teach Append that, when running in parallel,
> it should initialize a chunk of dynamic shared memory with an array
> indicating how many workers are currently working on each subplan.
>
Can the parallel-aware Append can recognize the current mode using
MyBgworkerEntry whether it is valid or not?

> We can also push other things in between the Gather and the Append,
> which wouldn't be possible in your design.  For example, consider a
> join between a partitioned table p and an unpartitioned table q.  We
> could do this:
> 
> Gather
>   -> Nested Loop
>     -> Append
>       -> Index Scan on p1
>       -> Partial Seq Scan on p2
>       -> Index Scan on p3
>   -> Index Scan on q
>       Index Cond q.x = p.x
> 
> That's a pretty sweet plan.  Assuming p1, p2, and p3 are all
> reasonably large, we could probably benefit from throwing at least 3
> processes at this plan tree - maybe more, if p2 is really big.  Only
> one process can work on each of p1 and p3, but p2, since it has a
> truly parallel plan, can soak up as many as we want to throw at it
> (although performance may top out at some point if we're I/O-bound).
>
We can also leverage the parallel capability on hash-join and merge-join
(even though it is not all the cases; in case when Append node runs on
the partitioned table with CHECK() constraint).
It is the reason why my colleague works for feature of the join before
append.

http://www.postgresql.org/message-id/12a9442fbae80d4e8953883e0b84e088606...@bpxm01gp.gisp.nec.co.jp

> Sorry for taking so long to give a really substantive reply on this,
> but it wasn't until this last week or so that I really had time to
> think about this in detail.
>
No worry, I also couldn't make a valid progress due to various jobs
not only community works...

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [DESIGN] ParallelAppend

Reply via email to