Re: Parallel INSERT (INTO ... SELECT ...)

Amit Kapila Fri, 19 Feb 2021 03:09:13 -0800

On Fri, Feb 19, 2021 at 10:13 AM tsunakawa.ta...@fujitsu.com
<tsunakawa.ta...@fujitsu.com> wrote:
>
> From: Greg Nancarrow <gregn4...@gmail.com>
> --------------------------------------------------
> On Mon, Jan 25, 2021 at 10:23 AM tsunakawa.ta...@fujitsu.com 
> <tsunakawa.ta...@fujitsu.com> wrote:
> > (8)
> > +               /*
> > +                * If the trigger type is RI_TRIGGER_FK, this indicates a 
> > FK exists in
> > +                * the relation, and this would result in creation of new 
> > CommandIds
> > +                * on insert/update/delete and this isn't supported in a 
> > parallel
> > +                * worker (but is safe in the parallel leader).
> > +                */
> > +               trigtype = RI_FKey_trigger_type(trigger->tgfoid);
> > +               if (trigtype == RI_TRIGGER_FK)
> > +               {
> > +                       if 
> > (max_parallel_hazard_test(PROPARALLEL_RESTRICTED, context))
> > +                               return true;
> > +               }
> >
> > Here, RI_TRIGGER_FK should instead be RI_TRIGGER_PK, because RI_TRIGGER_FK 
> > triggers do not generate command IDs.  See RI_FKey_check() which is called 
> > in RI_TRIGGER_FK case.  In there, ri_PerformCheck() is called with the 
> > detectNewRows argument set to false, which causes CommandCounterIncrement() 
> > to not be called.
> >
>
> Hmmm, I'm not sure that you have read and interpreted the patch code 
> correctly.
> The existence of a RI_TRIGGER_FK trigger indicates the table has a foreign 
> key, and an insert into such a table will generate a new commandId (so we 
> must avoid that, as we don't currently have the technology to support sharing 
> of new command IDs across the participants in the parallel operation). This 
> is what the code comment says, It does not say that such a trigger generates 
> a new command ID.
>
> See Amit's updated comment here: 
> https://github.com/postgres/postgres/commit/0d32511eca5aec205cb6b609638ea67129ef6665
>
> In addition, the 2nd patch has an explicit test case for this (testing insert 
> into a table that has a FK).
> --------------------------------------------------
>
>
> First of all, I anticipate this parallel INSERT SELECT feature will typically 
> shine, and expected to work, in the ETL or ELT into a data warehouse or an 
> ODS for analytics.  Bearing that in mind, let me list some issues or 
> questions below.  But the current state of the patch would be of course 
> attractive in some workloads, so I don't think these are not necessarily 
> blockers.
>
>
> (1)
> According to the classic book "The Data Warehouse Toolkit" and the website 
> [1] by its author, the fact table (large transaction history) in the data 
> warehouse has foreign keys referencing to the dimension tables (small or 
> medium-sized master or reference data).  So, parallel insert will be 
> effective if it works when loading data into the fact table with foreign keys.
>
> To answer the above question, I'm assuming:
>
> CREATE TABLE some_dimension (key_col int PRIMARY KEY);
> CREATE TABLE some_fact (some_key int REFERENCES some_dimension);
> INSERT INTO some_fact SELECT ...;
>
>
> My naive question is, "why should new command IDs be generated to check 
> foreign key constraints in this INSERT case?  The check just reads the parent 
> (some_dimension table here)..."
>


It is quite possible what you are saying is correct but I feel that is
not this patch's fault. So, won't it better to discuss this in a
separate thread?

>
>
> (2)
> Likewise, dimension tables have surrogate keys that are typically implemented 
> as a sequence or an identity column.  It is suggested that even fact tables 
> sometimes (or often?) have surrogate keys.  But the current patch does not 
> parallelize the statement when the target table has a sequence or an identity 
> column.
>
> I was looking at the sequence code, and my naive (again) idea is that the 
> parallel leader and workers allocates numbers from the sequence 
> independently, and sets the largest number of them as the session's currval 
> at the end of parallel operation.  We have to note in the documentation that 
> gaps in the sequence numbers will arise and not used in parallel DML.
>

Good use case but again, I think this can be done as a separate patch.

>
> (3)
> As Hou-san demonstrated, the current patch causes the resulting table and 
> index to become larger when inserted in parallel than in inserted serially.  
> This could be a problem for analytics use cases where the table is just 
> inserted and read only afterwards.  We could advise the user to run REINDEX 
> CONCURRENTLY after loading data, but what about tables?
>

I think here you are talking about the third patch (Parallel Inserts).
I guess if one has run inserts parallelly from psql then also similar
behavior would have been observed. For tables, it might lead to better
results once we have the patch discussed at [1]. Actually, this needs
more investigation.

[1] - 
https://www.postgresql.org/message-id/20200508072545.GA9701%40telsasoft.com

-- 
With Regards,
Amit Kapila.

Re: Parallel INSERT (INTO ... SELECT ...)

Reply via email to