On Mon, Sep 28, 2020 at 3:58 PM Bharath Rupireddy <bharath.rupireddyforpostg...@gmail.com> wrote: > > Thanks Andres for the comments. > > On Thu, Sep 24, 2020 at 8:11 AM Andres Freund <and...@anarazel.de> wrote: > > > > > The design: > > > > I think it'd be good if you could explain a bit more why you think this > > safe to do in the way you have done it. > > > > E.g. from a quick scroll through the patch, there's not even a comment > > explaining that the only reason there doesn't need to be code dealing > > with xid assignment because we already did the catalog changes to create > > the table. > > > > Yes we do a bunch of catalog changes related to the created new table. > We will have both the txn id and command id assigned when catalogue > changes are being made. But, right after the table is created in the > leader, the command id is incremented (CommandCounterIncrement() is > called from create_ctas_internal()) whereas the txn id remains the > same. The new command id is marked as GetCurrentCommandId(true); in > intorel_startup, then the parallel mode is entered. The txn id and > command id are serialized into parallel DSM, they are then available > to all parallel workers. This is discussed in [1]. > > Few changes I have to make in the parallel worker code: set > currentCommandIdUsed = true;, may be via a common API > SetCurrentCommandIdUsedForWorker() proposed in [1] and remove the > extra command id sharing from the leader to workers. > > I will add a few comments in the upcoming patch related to the above info. >
Yes, that would be good. > > > > But how does that work for SELECT INTO? Are you prohibiting > > that? ... > > > > In case of SELECT INTO, a new table gets created and I'm not > prohibiting the parallel inserts and I think we don't need to. > So, in this case, also do we ensure that table is created before we launch the workers. If so, I think you can explain in comments about it and what you need to do that to ensure the same. While skimming through the patch, a small thing I noticed: + /* + * SELECT part of the CTAS is parallelizable, so we can make + * each parallel worker insert the tuples that are resulted + * in it's execution into the target table. + */ + if (!is_matview && + IsA(plan->planTree, Gather)) + ((DR_intorel *) dest)->is_parallel = true; + I am not sure at this stage if this is the best way to make CTAS as parallel but if so, then probably you can expand the comments a bit to say why you consider only Gather node (and that too when it is the top-most node) and why not another parallel node like GatherMerge? > Thoughts? > > > > > > Below things are still pending. Thoughts are most welcome: > > > 1. How better we can lift the "cannot insert tuples in a parallel worker" > > > from heap_prepare_insert() for only CTAS cases or for that matter parallel > > > copy? How about having a variable in any of the worker global contexts and > > > use that? Of course, we can remove this restriction entirely in case we > > > fully allow parallelism for INSERT INTO SELECT, CTAS, and COPY. > > > > And for the purpose of your question, we could then have a > > table_insert_allow_parallel(TableInsertScan *); > > or an additional arg to table_begin_insert(). > > > > Removing "cannot insert tuples in a parallel worker" restriction from > heap_prepare_insert() is a common problem for parallel inserts in > general, i.e. parallel inserts in CTAS, parallel INSERT INTO > SELECTs[1] and parallel copy[2]. It will be good if a common solution > is agreed. > Right, for now, I think you can simply remove that check from the code instead of just commenting it. We will see if there is a better check/Assert we can add there. -- With Regards, Amit Kapila.