Re: [HACKERS] parallelize queries containing initplans

Kuntal Ghosh Thu, 16 Nov 2017 02:24:36 -0800

On Thu, Nov 16, 2017 at 3:34 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Wed, Nov 15, 2017 at 12:25 AM, Robert Haas <robertmh...@gmail.com> wrote:
>> On Tue, Nov 14, 2017 at 11:00 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>> Yeah, I'm sure it is.  I have a vague recollection that there might be
>>> existing regression test cases exercising such things, though I won't
>>> swear to that.  The "orderstest" bit in subselect.sql looks like it
>>> might be meant to do that...
>>
>
> I agree that such cases can cause a problem with fixed memory.
>
I'm able to come up with a test case to produce the problem.


regression[107494]=# select ten,count(*) from tenk1 a where a.ten in (select
b.ten from tenk1 b where (select a.ten from tenk1 c where c.ten =
a.ten limit 1) = b.ten limit 1) group by a.ten;

                    QUERY PLAN
---------------------------------------------------
HashAggregate
  Group Key: a.ten
  ->  Seq Scan on tenk1 a
        Filter: (SubPlan 2)
        SubPlan 2
          ->  Limit
                InitPlan 1 (returns $1)
                  ->  Limit
                        ->  Seq Scan on tenk1 c
                              Filter: (ten = a.ten)
                ->  Seq Scan on tenk1 b
                      Filter: ($1 = ten)
In the above case, Subplan 2 returns the same value of a.ten that is
used in initplan as a reference. So, this query is nothing but select
ten,count(*) from tenk1 group by a.ten. The output should be 10 rows
each row having a count=1000.

By enforcing parallelism, I got the following plan:
                      QUERY PLAN
------------------------------------------------------
HashAggregate
  Group Key: a.ten
  ->  Seq Scan on tenk1 a
        Filter: (SubPlan 2)
        SubPlan 2
          ->  Limit
                InitPlan 1 (returns $1)
                  ->  Limit
                        ->  Seq Scan on tenk1 c
                              Filter: (ten = a.ten)
                ->  Gather
                      Workers Planned: 2
                      Params Evaluated: $1
                      ->  Parallel Seq Scan on tenk1 b
                            Filter: ($1 = ten)

When I set parallel_leader_participation to off, I get the incorrect
result as follows:
 ten | count
------------
   0  | 1000
(1 row)
In the above case, the initplan gets evaluated only once from
ExecInitParallelPlan path. Hence, we see the incorrect result.

>> Here's an updated patch that attempts to work around this problem using DSA.
>>
>
> There were a couple of problems with your changes:
> 1.
> BufferUsage *buffer_usage; /* points to bufusage area in DSM */
> + dsa_handle param_exec; /* serialized PARAM_EXEC parameters */
> @@ -35,12 +36,13 @@ typedef struct ParallelExecutorInfo
>  } ParallelExecutorInfo;
>
> This should be dsa_pointer, otherwise, the value returned by
> SerializeParamExecParams will get truncated.
>
> 2. In ExecParallelReinitialize(), we need to evaluate the params
> before serializing them.
>
> 3. I think we should free the dsa pointer at the end of the gather.
>
> Attached patch fixes the mentioned problems.
>
>> It could use a test case that actually tickles the new logic in
>> ExecParallelReinitialize ... this is mostly just to show the concept.
>>
>
> Thanks, it was quite helpful.
>
I've tested the above-mentioned scenario with this patch and it is
working fine. Also, I've created a text column named 'vartext',
inserted some random length texts(max length 100) and tweaked the
above query as follows:
select ten,count(*) from tenk1 a where a.ten in (select
b.ten from tenk1 b where (select a.vartext from tenk1 c where c.ten =
a.ten limit 1) = b.vartext limit 1) group by a.ten;
This query is equivalent to select ten,count(*) from tenk1 group by
a.ten. It also produced the expected result without throwing any
error.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] parallelize queries containing initplans

Reply via email to