Hi hackers, It seems the function `get_variable_numdistinct` ignore the case when stanullfrac is 1.0:
# create table t(a int, b int); CREATE TABLE # insert into t select i from generate_series(1, 10000)i; INSERT 0 10000 gpadmin=# analyze t; ANALYZE # explain analyze select b, count(1) from t group by b; QUERY PLAN ----------------------------------------------------------------------------------------------------------- HashAggregate (cost=195.00..197.00 rows=200 width=12) (actual time=5.928..5.930 rows=1 loops=1) Group Key: b Batches: 1 Memory Usage: 40kB -> Seq Scan on t (cost=0.00..145.00 rows=10000 width=4) (actual time=0.018..1.747 rows=10000 loops=1) Planning Time: 0.237 ms Execution Time: 5.983 ms (6 rows) So it gives the estimate using the default value: 200. I have added some lines of code to take `stanullfrac ==1.0` into account. With the patch attached, we now get: # explain analyze select b, count(1) from t group by b; QUERY PLAN ----------------------------------------------------------------------------------------------------------- HashAggregate (cost=195.00..195.01 rows=1 width=12) (actual time=6.163..6.164 rows=1 loops=1) Group Key: b Batches: 1 Memory Usage: 24kB -> Seq Scan on t (cost=0.00..145.00 rows=10000 width=4) (actual time=0.024..1.823 rows=10000 loops=1) Planning Time: 0.535 ms Execution Time: 6.344 ms (6 rows) I am not sure if this change is valuable in practical env, but it should go in the correct direction. Any comments on this are appreciated.
0001-Consider-the-case-when-stanullfrac-is-1.0-in-get_var.patch
Description: 0001-Consider-the-case-when-stanullfrac-is-1.0-in-get_var.patch