Re: Consider the number of columns in the sort cost model

Alena Rybakina Thu, 31 Oct 2024 07:00:35 -0700

On 29.10.2024 05:47, Andrei Lepikhov wrote:

On 10/28/24 16:48, Alena Rybakina wrote:
On 23.10.2024 04:39, Andrei Lepikhov wrote:
On 15/10/2024 12:15, David Rowley wrote:
And the last patch is a demo of how I'm going to use the previousthree patches and add one more strategy to improve the order ofcolumns in the GROUP-BY clause.
To be honest, I didn’t find information about this in the code, butdid I understand everything correctly?
Yes

Thanks for the confirmation.

2. I noticed that statistics of distinct values are calculatedseveral times. Maybe I miss something, but this can be avoided ifthese statistics can be saved after calculation. For example, I sawthat it is possible that you calculate the same statistic informationfor the same equivalence members in cost_incremental_sort andidentify_sort_ecmember. Is it possible to store information in astructure and use it later?
Hmm, I don't see multiple calculations. em_distinct has madespecifically for this sake. Can you provide specific case and code lines?

When you said that, I saw it. You are right, checking for 0 is whatstops us from useless recalculation. I noticed it right away.

3. I think you should initialize the variable ndist in your patch. Ifaced the compile complaining during your code compilation.
costsize.c: In function ‘identify_sort_ecmember’:
costsize.c:6694:42: error: ‘ndist’ may be used uninitialized[-Werror=maybe-uninitialized]
  6694 |                         em->em_ndistinct = ndist;
       |                         ~~~~~~~~~~~~~~~~~^~~~~
costsize.c:6680:33: note: ‘ndist’ was declared here
  6680 |                         double  ndist;
       |                                 ^~~
cc1: all warnings being treated as errors
gmake[4]: *** [<builtin>: costsize.o] Error 1
I think you can just update your compiler. But I added the ndistinitialisation to make more compilers happy :).

Thank you :)

+        Assert (node != NULL);
+
          examine_variable(root, node, 0, &vardata);
          if (!HeapTupleIsValid(vardata.statsTuple))
              continue;
I don't think so. At least until you provide the case when theget_sortgroupclause_expr function returns NULL.That's more, remember - the patch 0004 here is just to show theperspective and still under construction.Anyway, thanks, I found out that the patch set doesn't apply correctlybecause of 828e94c. So, see the new version in the attachment.

Yes, you are right, if this case is possible, then it is connected withan error in the formation of the target list.

I played around with the examples a bit and couldn't figure outsomething. When I added the same values to different columns - firstlyin a, later in b, the order of the columns for sort operation doesn'tchange. Isn't this a mistake?


create table a (x1 int, y1 int);
create table b (x2 int, y2 int);
insert into a values (NULL, NULL);
insert into a values (NULL, 1);
insert into a values (1, 1);
insert into a values (1, NULL);

create index a_x1_idx on a(x1);
create index b_x2_idx on b(x2);
create index a_y1_idx on a(y1);
create index b_y2_idx on b(y2);

insert into b select 1, 2 from generate_series(11,20) as id;
insert into b select 1, 1 from generate_series(1,10) as id;
insert into b select 1, 3 from generate_series(3,30) as id;

explain analyze select a.x1, s.x2, s.y2 from a left join (selectdistinct * from b) s on a.x1=s.x2;

                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------

Hash Right Join (cost=44.99..48.15 rows=5 width=12) (actualtime=0.225..0.250 rows=8 loops=1)

   Hash Cond: (b.x2 = a.x1)

-> HashAggregate (cost=43.90..46.16 rows=226 width=8) (actualtime=0.117..0.123 rows=3 loops=1)

         Group Key: b.x2, b.y2
         Batches: 1  Memory Usage: 40kB

-> Seq Scan on b (cost=0.00..32.60 rows=2260 width=8)(actual time=0.030..0.044 rows=48 loops=1) -> Hash (cost=1.04..1.04 rows=4 width=4) (actual time=0.073..0.074rows=4 loops=1)

         Buckets: 1024  Batches: 1  Memory Usage: 9kB

-> Seq Scan on a (cost=0.00..1.04 rows=4 width=4) (actualtime=0.047..0.051 rows=4 loops=1)

 Planning Time: 1.649 ms
 Execution Time: 0.485 ms
(11 rows)

delete from b;
insert into b select 2, 1 from generate_series(11,20) as id;
insert into b select 1, 1 from generate_series(1,10) as id;
insert into b select 3, 1 from generate_series(3,30) as id;
vacuum analyze;

explain analyze select a.x1, s.x2, s.y2 from a left join (selectdistinct * from b) s on a.x1=s.x2;

                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------

Hash Left Join (cost=1.79..2.86 rows=4 width=12) (actualtime=0.083..0.090 rows=4 loops=1)

   Hash Cond: (a.x1 = b.x2)

-> Seq Scan on a (cost=0.00..1.04 rows=4 width=4) (actualtime=0.010..0.011 rows=4 loops=1) -> Hash (cost=1.75..1.75 rows=3 width=8) (actual time=0.067..0.068rows=3 loops=1)

         Buckets: 1024  Batches: 1  Memory Usage: 9kB

-> HashAggregate (cost=1.72..1.75 rows=3 width=8) (actualtime=0.063..0.064 rows=3 loops=1)

               Group Key: b.x2, b.y2
               Batches: 1  Memory Usage: 24kB

-> Seq Scan on b (cost=0.00..1.48 rows=48 width=8)(actual time=0.006..0.014 rows=48 loops=1)

 Planning Time: 0.391 ms
 Execution Time: 0.151 ms
(11 rows)



I noticed a spelling: "significance phusics".

/*
 * Calculate multiplier reflecting the number of comparisons which executor
 * have to perform during the sort with this specific order of columns.
 *
 * The comparison factor f = 1.+F(pathkeys). There 1. incapsulates the

* second-order of *significance phusics* which cost function doesn'tconsider.


--
Regards,
Alena Rybakina
Postgres Professional

Re: Consider the number of columns in the sort cost model

Reply via email to