Hi!

On 27.01.2025 11:05, Richard Guo wrote:
(This was briefly discussed in [1], which primarily focuses on the
incremental sort regression.  So start a new thread for this topic.)

In set_append_rel_size(), we currently set rel->tuples to rel->rows
for an appendrel.  Generally, rel->tuples is the raw number of tuples
in the relation and rel->rows is the estimated number of tuples after
the relation's restriction clauses have been applied.  Although an
appendrel itself doesn't directly enforce any quals today, its child
relations may.  Therefore, setting rel->tuples equal to rel->rows for
an appendrel isn't always appropriate.

AFAICS, doing so can lead to issues in cost estimates.  For instance,
when estimating the number of distinct values from an appendrel, we
would not be able to adjust the estimate based on the restriction
selectivity (see estimate_num_groups()).

Attached is a patch that sets an appendrel's tuples to the total
number of tuples accumulated from each live child, which I believe
aligns better with reality.

Here's a simple example that demonstrates how this change improves
cost estimates in certain cases.

create table p (a int, b int, c float) partition by range(a);
create table p1 partition of p for values from (0) to (1000);
create table p2 partition of p for values from (1000) to (2000);

insert into p select i%2000, random(1, 100000), random(1, 100000) from
generate_series(1, 1000000)i;

analyze p;

explain analyze select b from p where c < 10000 group by b;

-- without this patch
  HashAggregate  (cost=18651.38..19568.54 rows=91716 width=4)
                 (actual time=467.859..487.227 rows=63346 loops=1)

-- with this patch
  HashAggregate  (cost=18651.38..19275.60 rows=62422 width=4)
                 (actual time=447.383..466.351 rows=63346 loops=1)
I looked at it and agree with your solution.

Unfortunately, I failed to come up with a stable test case that shows
a plan diff with this change.  So the attached patch does not include
a test case for now.

I created a stable test:

create table p (a int, b int, c float) partition by range(a);
create table p1 partition of p for values from (0) to (100);
create table p2 partition of p for values from (100) to (1000);
insert into p select i%200, i%300, i%400from
generate_series(1, 1000)i;
analyze p;
SELECT * FROM check_estimated_rows('select b from p where c < 10 group by b');
estimated | actual
-----------+--------
27| 29
(1row)
drop table p;
I added it in the diff file.

--
Regards,
Alena Rybakina
Postgres Professional
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index f0707e7f7ea..f01d0d41bfd 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -4468,4 +4468,17 @@ explain (costs off) select * from hp_contradict_test where a === 1 and b === 1 a
 drop table hp_contradict_test;
 drop operator class part_test_int4_ops2 using hash;
 drop operator ===(int4, int4);
+create table p (a int, b int, c float) partition by range(a);
+create table p1 partition of p for values from (0) to (100);
+create table p2 partition of p for values from (100) to (1000);
+insert into p select i%200, i%300, i%400 from
+generate_series(1, 1000)i;
+analyze p;
+SELECT * FROM check_estimated_rows('select b from p where c < 10 group by b');
+ estimated | actual 
+-----------+--------
+        27 |     29
+(1 row)
+
+drop table p;
 drop function explain_analyze(text);
diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql
index ea9a4fe4a23..0e7a775e469 100644
--- a/src/test/regress/sql/partition_prune.sql
+++ b/src/test/regress/sql/partition_prune.sql
@@ -1353,4 +1353,15 @@ drop table hp_contradict_test;
 drop operator class part_test_int4_ops2 using hash;
 drop operator ===(int4, int4);
 
+create table p (a int, b int, c float) partition by range(a);
+create table p1 partition of p for values from (0) to (100);
+create table p2 partition of p for values from (100) to (1000);
+
+insert into p select i%200, i%300, i%400 from
+generate_series(1, 1000)i;
+
+analyze p;
+SELECT * FROM check_estimated_rows('select b from p where c < 10 group by b');
+
+drop table p;
 drop function explain_analyze(text);

Reply via email to