Volatile Functions in Parallel Plans

Zhenghua Lyu Wed, 15 Jul 2020 05:45:21 -0700

Hi,
    I test some SQL in the latest Postgres master branch code (we find these 
issues when
developing Greenplum database in the PR 
https://github.com/greenplum-db/gpdb/pull/10418,
and my colleague come up with the following cases in Postgres):


create table t3 (c1 text, c2 text);
CREATE TABLE
insert into t3
select
  'fhufehwiohewiuewhuhwiufhwifhweuhfwu', --random data
  'fiowehufwhfyegygfewpfwwfeuhwhufwh' --random data
from generate_series(1, 10000000) i;
INSERT 0 10000000
analyze t3;
ANALYZE
create table t4 (like t3);
CREATE TABLE
insert into t4 select * from t4;
INSERT 0 0
insert into t4 select * from t3;
INSERT 0 10000000
analyze t4;
ANALYZE
set enable_hashjoin to off;
SET
explain (costs off)
select count(*) from t3, t4
where t3.c1 like '%sss'
      and timeofday() = t4.c1 and t3.c1 = t4.c1;
                       QUERY PLAN
--------------------------------------------------------
 Finalize Aggregate
   ->  Gather
         Workers Planned: 2
         ->  Partial Aggregate
               ->  Nested Loop
                     Join Filter: (t3.c1 = t4.c1)
                     ->  Parallel Seq Scan on t3
                           Filter: (c1 ~~ '%sss'::text)
                     ->  Seq Scan on t4
                           Filter: (timeofday() = c1)
(10 rows)

explain (verbose, costs off)
select count(*)
from
  t3,
  (select *, timeofday() as x from t4 ) t4
where t3.c1 like '%sss' and
      timeofday() = t4.c1 and t3.c1 = t4.c1;
                            QUERY PLAN
------------------------------------------------------------------
 Finalize Aggregate
   Output: count(*)
   ->  Gather
         Output: (PARTIAL count(*))
         Workers Planned: 2
         ->  Partial Aggregate
               Output: PARTIAL count(*)
               ->  Nested Loop
                     Join Filter: (t3.c1 = t4.c1)
                     ->  Parallel Seq Scan on public.t3
                           Output: t3.c1, t3.c2
                           Filter: (t3.c1 ~~ '%sss'::text)
                     ->  Seq Scan on public.t4
                           Output: t4.c1, NULL::text, timeofday()
                           Filter: (timeofday() = t4.c1)
(15 rows)


Focus on the last two plans, the function timeofday is
volatile but paralle-safe. And Postgres outputs two parallel
plan.


The first plan:
 Finalize Aggregate
   ->  Gather
         Workers Planned: 2
         ->  Partial Aggregate
               ->  Nested Loop
                     Join Filter: (t3.c1 = t4.c1)
                     ->  Parallel Seq Scan on t3
                           Filter: (c1 ~~ '%sss'::text)
                     ->  Seq Scan on t4
                           Filter: (timeofday() = c1)

The join's left tree is parallel scan and the right tree is seq scan.
This algorithm is correct using the distribute distributive law of
distributed join:
       A = [A1 A2 A3...An], B then A join B = gather( (A1 join B) (A2 join B) 
... (An join B) )

The correctness of the above law should have a pre-assumption:
      The data set of B is the same in each join: (A1 join B) (A2 join B) ... 
(An join B)

But things get complicated when volatile functions come in. Timeofday is just
an example to show the idea. The core is volatile functions  can return 
different
results on successive calls with the same arguments. Thus the following piece,
the right tree of the join
                     ->  Seq Scan on t4
                           Filter: (timeofday() = c1)
can not be considered consistent everywhere in the scan workers.

The second plan

 Finalize Aggregate
   Output: count(*)
   ->  Gather
         Output: (PARTIAL count(*))
         Workers Planned: 2
         ->  Partial Aggregate
               Output: PARTIAL count(*)
               ->  Nested Loop
                     Join Filter: (t3.c1 = t4.c1)
                     ->  Parallel Seq Scan on public.t3
                           Output: t3.c1, t3.c2
                           Filter: (t3.c1 ~~ '%sss'::text)
                     ->  Seq Scan on public.t4
                           Output: t4.c1, NULL::text, timeofday()
                           Filter: (timeofday() = t4.c1)

have voltile projections in the right tree of the nestloop:

                     ->  Seq Scan on public.t4
                           Output: t4.c1, NULL::text, timeofday()
                           Filter: (timeofday() = t4.c1)

It should not be taken as consistent in different workers.

------------------------------------------------------------------------------------------

The above are just two cases we find today. And it should be enough to
show the core issue to have a discussion here.

The question is, should we consider volatile functions when generating
parallel plans?

------------------------------------------------------------------------------------------
FYI, some plan diffs of Greenplum can be found here: 
https://www.diffnow.com/report/etulf

Volatile Functions in Parallel Plans

Reply via email to