Re: An improvement on parallel DISTINCT

Richard Guo Thu, 01 Feb 2024 23:47:22 -0800

On Fri, Feb 2, 2024 at 11:26 AM David Rowley <dgrowle...@gmail.com> wrote:


> In light of this, do you still think it's worthwhile making this change?
>
> For me, I think all it's going to result in is extra planner work
> without any performance gains.


Hmm, with the query below, I can see that the new plan is cheaper than
the old plan, and the cost difference exceeds STD_FUZZ_FACTOR.

create table t (a int, b int);
insert into t select i%100000, i from generate_series(1,10000000)i;
analyze t;

-- on master
explain (costs on) select distinct a from t order by a limit 1;
                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
 Limit  (cost=120188.50..120188.51 rows=1 width=4)
   ->  Sort  (cost=120188.50..120436.95 rows=99379 width=4)
         Sort Key: a
         ->  HashAggregate  (cost=118697.82..119691.61 rows=99379 width=4)
               Group Key: a
               ->  Gather  (cost=97331.33..118200.92 rows=198758 width=4)
                     Workers Planned: 2
                     ->  HashAggregate  (cost=96331.33..97325.12 rows=99379
width=4)
                           Group Key: a
                           ->  Parallel Seq Scan on t  (cost=0.00..85914.67
rows=4166667 width=4)
(10 rows)

-- on patched
explain (costs on) select distinct a from t order by a limit 1;
                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
 Limit  (cost=106573.93..106574.17 rows=1 width=4)
   ->  Unique  (cost=106573.93..130260.88 rows=99379 width=4)
         ->  Gather Merge  (cost=106573.93..129763.98 rows=198758 width=4)
               Workers Planned: 2
               ->  Sort  (cost=105573.91..105822.35 rows=99379 width=4)
                     Sort Key: a
                     ->  HashAggregate  (cost=96331.33..97325.12 rows=99379
width=4)
                           Group Key: a
                           ->  Parallel Seq Scan on t  (cost=0.00..85914.67
rows=4166667 width=4)
(9 rows)

It seems that including a LIMIT clause can potentially favor the new
plan.

Thanks
Richard

Re: An improvement on parallel DISTINCT

Reply via email to