On 29.10.2024 05:47, Andrei Lepikhov wrote:
On 10/28/24 16:48, Alena Rybakina wrote:
On 23.10.2024 04:39, Andrei Lepikhov wrote:
On 15/10/2024 12:15, David Rowley wrote:
And the last patch is a demo of how I'm going to use the previous
three patches and add one more strategy to improve the order of
columns in the GROUP-BY clause.
To be honest, I didn’t find information about this in the code, but
did I understand everything correctly?
Yes
Thanks for the confirmation.
2. I noticed that statistics of distinct values are calculated
several times. Maybe I miss something, but this can be avoided if
these statistics can be saved after calculation. For example, I saw
that it is possible that you calculate the same statistic information
for the same equivalence members in cost_incremental_sort and
identify_sort_ecmember. Is it possible to store information in a
structure and use it later?
Hmm, I don't see multiple calculations. em_distinct has made
specifically for this sake. Can you provide specific case and code lines?
When you said that, I saw it. You are right, checking for 0 is what
stops us from useless recalculation. I noticed it right away.
3. I think you should initialize the variable ndist in your patch. I
faced the compile complaining during your code compilation.
costsize.c: In function ‘identify_sort_ecmember’:
costsize.c:6694:42: error: ‘ndist’ may be used uninitialized
[-Werror=maybe-uninitialized]
6694 | em->em_ndistinct = ndist;
| ~~~~~~~~~~~~~~~~~^~~~~
costsize.c:6680:33: note: ‘ndist’ was declared here
6680 | double ndist;
| ^~~
cc1: all warnings being treated as errors
gmake[4]: *** [<builtin>: costsize.o] Error 1
I think you can just update your compiler. But I added the ndist
initialisation to make more compilers happy :).
Thank you :)
+ Assert (node != NULL);
+
examine_variable(root, node, 0, &vardata);
if (!HeapTupleIsValid(vardata.statsTuple))
continue;
I don't think so. At least until you provide the case when the
get_sortgroupclause_expr function returns NULL.
That's more, remember - the patch 0004 here is just to show the
perspective and still under construction.
Anyway, thanks, I found out that the patch set doesn't apply correctly
because of 828e94c. So, see the new version in the attachment.
Yes, you are right, if this case is possible, then it is connected with
an error in the formation of the target list.
I played around with the examples a bit and couldn't figure out
something. When I added the same values to different columns - firstly
in a, later in b, the order of the columns for sort operation doesn't
change. Isn't this a mistake?
create table a (x1 int, y1 int);
create table b (x2 int, y2 int);
insert into a values (NULL, NULL);
insert into a values (NULL, 1);
insert into a values (1, 1);
insert into a values (1, NULL);
create index a_x1_idx on a(x1);
create index b_x2_idx on b(x2);
create index a_y1_idx on a(y1);
create index b_y2_idx on b(y2);
insert into b select 1, 2 from generate_series(11,20) as id;
insert into b select 1, 1 from generate_series(1,10) as id;
insert into b select 1, 3 from generate_series(3,30) as id;
explain analyze select a.x1, s.x2, s.y2 from a left join (select
distinct * from b) s on a.x1=s.x2;
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Hash Right Join (cost=44.99..48.15 rows=5 width=12) (actual
time=0.225..0.250 rows=8 loops=1)
Hash Cond: (b.x2 = a.x1)
-> HashAggregate (cost=43.90..46.16 rows=226 width=8) (actual
time=0.117..0.123 rows=3 loops=1)
Group Key: b.x2, b.y2
Batches: 1 Memory Usage: 40kB
-> Seq Scan on b (cost=0.00..32.60 rows=2260 width=8)
(actual time=0.030..0.044 rows=48 loops=1)
-> Hash (cost=1.04..1.04 rows=4 width=4) (actual time=0.073..0.074
rows=4 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on a (cost=0.00..1.04 rows=4 width=4) (actual
time=0.047..0.051 rows=4 loops=1)
Planning Time: 1.649 ms
Execution Time: 0.485 ms
(11 rows)
delete from b;
insert into b select 2, 1 from generate_series(11,20) as id;
insert into b select 1, 1 from generate_series(1,10) as id;
insert into b select 3, 1 from generate_series(3,30) as id;
vacuum analyze;
explain analyze select a.x1, s.x2, s.y2 from a left join (select
distinct * from b) s on a.x1=s.x2;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=1.79..2.86 rows=4 width=12) (actual
time=0.083..0.090 rows=4 loops=1)
Hash Cond: (a.x1 = b.x2)
-> Seq Scan on a (cost=0.00..1.04 rows=4 width=4) (actual
time=0.010..0.011 rows=4 loops=1)
-> Hash (cost=1.75..1.75 rows=3 width=8) (actual time=0.067..0.068
rows=3 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> HashAggregate (cost=1.72..1.75 rows=3 width=8) (actual
time=0.063..0.064 rows=3 loops=1)
Group Key: b.x2, b.y2
Batches: 1 Memory Usage: 24kB
-> Seq Scan on b (cost=0.00..1.48 rows=48 width=8)
(actual time=0.006..0.014 rows=48 loops=1)
Planning Time: 0.391 ms
Execution Time: 0.151 ms
(11 rows)
I noticed a spelling: "significance phusics".
/*
* Calculate multiplier reflecting the number of comparisons which executor
* have to perform during the sort with this specific order of columns.
*
* The comparison factor f = 1.+F(pathkeys). There 1. incapsulates the
* second-order of *significance phusics* which cost function doesn't
consider.
--
Regards,
Alena Rybakina
Postgres Professional