Re: Parallel Inserts in CREATE TABLE AS

Amit Kapila Mon, 05 Oct 2020 22:29:16 -0700

On Mon, Sep 28, 2020 at 3:58 PM Bharath Rupireddy
<bharath.rupireddyforpostg...@gmail.com> wrote:
>
> Thanks Andres for the comments.
>
> On Thu, Sep 24, 2020 at 8:11 AM Andres Freund <and...@anarazel.de> wrote:
> >
> > > The design:
> >
> > I think it'd be good if you could explain a bit more why you think this
> > safe to do in the way you have done it.
> >
> > E.g. from a quick scroll through the patch, there's not even a comment
> > explaining that the only reason there doesn't need to be code dealing
> > with xid assignment because we already did the catalog changes to create
> > the table.
> >
>
> Yes we do a bunch of catalog changes related to the created new table.
> We will have both the txn id and command id assigned when catalogue
> changes are being made. But, right after the table is created in the
> leader, the command id is incremented (CommandCounterIncrement() is
> called from create_ctas_internal()) whereas the txn id remains the
> same. The new command id is marked as GetCurrentCommandId(true); in
> intorel_startup, then the parallel mode is entered. The txn id and
> command id are serialized into parallel DSM, they are then available
> to all parallel workers. This is discussed in [1].
>
> Few changes I have to make in the parallel worker code: set
> currentCommandIdUsed = true;, may be via a common API
> SetCurrentCommandIdUsedForWorker() proposed in [1] and remove the
> extra command id sharing from the leader to workers.
>
> I will add a few comments in the upcoming patch related to the above info.
>


Yes, that would be good.

> >
> > But how does that work for SELECT INTO? Are you prohibiting
> > that? ...
> >
>
> In case of SELECT INTO, a new table gets created and I'm not
> prohibiting the parallel inserts and I think we don't need to.
>

So, in this case, also do we ensure that table is created before we
launch the workers. If so, I think you can explain in comments about
it and what you need to do that to ensure the same.

While skimming through the patch, a small thing I noticed:
+ /*
+ * SELECT part of the CTAS is parallelizable, so we can make
+ * each parallel worker insert the tuples that are resulted
+ * in it's execution into the target table.
+ */
+ if (!is_matview &&
+ IsA(plan->planTree, Gather))
+ ((DR_intorel *) dest)->is_parallel = true;
+

I am not sure at this stage if this is the best way to make CTAS as
parallel but if so, then probably you can expand the comments a bit to
say why you consider only Gather node (and that too when it is the
top-most node) and why not another parallel node like GatherMerge?

> Thoughts?
>
> >
> > > Below things are still pending. Thoughts are most welcome:
> > > 1. How better we can lift the "cannot insert tuples in a parallel worker"
> > > from heap_prepare_insert() for only CTAS cases or for that matter parallel
> > > copy? How about having a variable in any of the worker global contexts and
> > > use that? Of course, we can remove this restriction entirely in case we
> > > fully allow parallelism for INSERT INTO SELECT, CTAS, and COPY.
> >
> > And for the purpose of your question, we could then have a
> > table_insert_allow_parallel(TableInsertScan *);
> > or an additional arg to table_begin_insert().
> >
>
> Removing "cannot insert tuples in a parallel worker" restriction from
> heap_prepare_insert() is a common problem for parallel inserts in
> general, i.e. parallel inserts in CTAS, parallel INSERT INTO
> SELECTs[1] and parallel copy[2]. It will be good if a common solution
> is agreed.
>

Right, for now, I think you can simply remove that check from the code
instead of just commenting it. We will see if there is a better
check/Assert we can add there.

-- 
With Regards,
Amit Kapila.

Re: Parallel Inserts in CREATE TABLE AS

Reply via email to