Re: Parallel INSERT SELECT take 2

Bharath Rupireddy Fri, 23 Apr 2021 08:43:46 -0700

On Thu, Apr 22, 2021 at 4:51 PM houzj.f...@fujitsu.com
<houzj.f...@fujitsu.com> wrote:
>
> > BACKGROUND
> > ========================================
> >
> > We want to realize parallel INSERT SELECT in the following steps:
> > 1) INSERT + parallel SELECT
> > 2) Parallel INSERT + parallel SELECT
> >
> > Below are example use cases.  We don't expect high concurrency or an empty
> > data source.
> > * Data loading (ETL or ELT) into an analytics database, typically a data 
> > ware
> > house.
> > * Batch processing in an OLTP database.
> > 2) Enabling users to declare that the table allows parallel data 
> > modification Add
> > a table property that represents parallel safety of the table for DML 
> > statement
> > execution.  Users specify it as follows:
> >
> > CREATE TABLE table_name (...) PARALLEL { UNSAFE | RESTRICTED | SAFE };
> >     ALTER TABLE table_name PARALLEL { UNSAFE | RESTRICTED | SAFE };
> >
> > This property is recorded in pg_class's relparallel column as 'u', 'r', or 
> > 's', just
> > like pg_proc's proparallel.  The default is UNSAFE.
> >
> > The planner assumes that all of the table, its descendant partitions, and 
> > their
> > ancillary objects have the specified parallel safety or safer one.  The 
> > user is
> > responsible for its correctness.  If the parallel processes find an object 
> > that is
> > less safer than the assumed parallel safety during statement execution, it
> > throws an ERROR and abort the statement execution.
> >
> > When the parallel safety of some of these objects is changed, it's costly to
> > reflect it on the parallel safety of tables that depend on them.  So, we 
> > don't do
> > it.  Instead, we provide a utility function 
> > pg_get_parallel_safety('table_name')
> > that returns records of (objid, classid, parallel_safety) that represent the
> > parallel safety of objects that determine the parallel safety of the 
> > specified
> > table.  The function only outputs objects that are not parallel safe.  
> > Otherwise,
> > it will consume excessive memory while accumulating the output.  The user
> > can use this function to identify problematic objects when a parallel DML 
> > fails
> > or is not parallelized in an expected manner.
> >
> > How does the executor detect parallel unsafe objects?  There are two ways:
> >
> > 1) At loading time
> > ...
> > 2) At function execution time
> > All related objects come down to some function execution.  So, add a 
> > parallel
> > safety check there when in a parallel worker.  If the current process is a 
> > parallel
> > worker and the function is parallel unsafe, error out with ereport(ERROR).  
> > This
> > approach eliminates the oversight of parallel safety check with the 
> > additional
> > bonus of tiny code change!
> >
> > The place would be FunctionCallInvoke().  It's a macro in fmgr.h now.  
> > Perhaps
> > we should make it a function in fmgr.c, so that fmgr.h does not have to 
> > include
> > header files for parallelism-related definitions.
> >
> > We have to evaluate the performance effect of converting 
> > FunctionCallInvoke()
> > into a function and adding an if statement there, because it's a relatively
> > low-level function.
>
> Based on above, we plan to move forward with the apporache 2) (declarative 
> idea).


IIUC, the declarative behaviour idea attributes parallel
safe/unsafe/restricted tags to each table with default being the
unsafe. Does it mean for a parallel unsafe table, no parallel selects,
inserts (may be updates) will be picked up? Or is it only the parallel
inserts? If both parallel inserts, selects will be picked, then the
existing tables need to be adjusted to set the parallel safety tags
while migrating?

Another point, what does it mean a table being parallel restricted?
What should happen if it is present in a query of other parallel safe
tables?

I may be wrong here: IIUC, the main problem we are trying to solve
with the declarative approach is to let the user decide parallel
safety for partition tables as it may be costlier for postgres to
determine it. And for the normal tables we can perform parallel safety
checks without incurring much cost. So, I think we should restrict the
declarative approach to only partitioned tables?

While reading the design, I came across this "erroring out during
execution of a query when a parallel unsafe function is detected". If
this is correct, isn't it warranting users to run
pg_get_parallel_safety to know the parallel unsafe objects, set
parallel safety to all of them if possible, otherwise disable
parallelism to run the query? Isn't this burdensome? Instead, how
about postgres retries the query upon detecting the error that came
from a parallel unsafe function during execution, disable parallelism
and run the query? I think this kind of retry query feature can be
built outside of the core postgres, but IMO it will be good to have
inside (of course configurable). IIRC, the Teradata database has a
Query Retry feature.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel INSERT SELECT take 2

Reply via email to