Underestimated number of output rows with an aggregate function

2023-10-15 Thread Philippe BEAUDOIN

Hi all,

Working on the emaj extension (for the curious ones, 
https://emaj.readthedocs.io/en/latest/ and 
https://github.com/dalibo/emaj), I recently faced a performance problem 
when querying and aggregating data changes. A query with 3 CTE has a O^2 
behavior (https://explain.dalibo.com/plan/1ded242d4ebf3gch#plan). I have 
found a workaround by setting enable_nestloop to FALSE. But this has 
drawbacks. So I want to better understand the issue.


During my analysis, I realized that the output rows estimate of the 
second CTE is really bad, leading to a bad plan for the next CTE.


I reproduced the issue in a very small test case with a simplified 
query. Attached is a shell script and its output.


A simple table is created, filled and analyzed.

The simplified statement is:
 WITH keys AS (
   SELECT c1, min(seq) AS seq FROM perf GROUP BY c1
   )
 SELECT tbl.*
   FROM perf tbl JOIN keys ON (keys.c1 = tbl.c1 AND keys.seq = tbl.seq);

Its plan is:
 Hash Join (cost=958.00..1569.00 rows=1 width=262) (actual 
time=18.516..30.702 rows=1 loops=1)

   Output: tbl.c1, tbl.seq, tbl.c2
   Inner Unique: true
   Hash Cond: ((tbl.c1 = perf.c1) AND (tbl.seq = (min(perf.seq
   Buffers: shared hit=856
   ->  Seq Scan on public.perf tbl  (cost=0.00..548.00 rows=12000 
width=262) (actual time=0.007..2.323 rows=12000 loops=1)

 Output: tbl.c1, tbl.seq, tbl.c2
 Buffers: shared hit=428
   ->  Hash  (cost=808.00..808.00 rows=1 width=8) (actual 
time=18.480..18.484 rows=1 loops=1)

 Output: perf.c1, (min(perf.seq))
 Buckets: 16384  Batches: 1  Memory Usage: 519kB
 Buffers: shared hit=428
 ->  HashAggregate  (cost=608.00..708.00 rows=1 width=8) 
(actual time=10.688..14.321 rows=1 loops=1)

   Output: perf.c1, min(perf.seq)
   Group Key: perf.c1
   Batches: 1  Memory Usage: 1425kB
   Buffers: shared hit=428
   ->  Seq Scan on public.perf (cost=0.00..548.00 
rows=12000 width=8) (actual time=0.002..2.330 rows=12000 loops=1)

 Output: perf.c1, perf.seq, perf.c2
 Buffers: shared hit=428

It globally looks good to me, with 2 sequential scans and a hash join.
But the number of returned rows estimate is always 1, while it actually 
depends on the data content (here 1).


For the hash join node, the plan shows a "Inner Unique: true" property. 
I wonder if this is normal. It look likes the optimizer doesn't take 
into account the presence of the GROUP BY clause in its estimate.


I reproduce the case with all supported postgres versions.

Thanks by advance for any explanation.
Philippe.


test_case.sh
Description: application/shellscript
select version();
version
---
 PostgreSQL 15.4 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit
(1 row)

\timing on
Timing is on.
\set ON_ERROR_STOP
SET client_min_messages TO WARNING;
SET
Time: 0,197 ms
--> create structures
create table perf (
  c1integernot null,
  seq   serial not null,
  c2text
);
CREATE TABLE
Time: 10,592 ms
--> populate the table
insert into perf (c1, c2) select i, rpad('2',300) from generate_series (1, 1000*:p_scaleFactor) i;
INSERT 0 1
Time: 49,318 ms
--> perform changes
insert into perf (c1,c2) select c1, 'updated' from perf where c1 % 5 = 0;
INSERT 0 2000
Time: 34,606 ms
--> vacuum and analyze
select count(*) from perf;
 count 
---
 12000
(1 row)

Time: 4,693 ms
vacuum analyze perf;
VACUUM
Time: 28,937 ms
--> bad estimate of the number of output rows (always 1)
explain (analyze, buffers, verbose)
 WITH keys AS (
   SELECT c1, min(seq) AS seq
 FROM perf
 GROUP BY c1
   )
   SELECT tbl.*
 FROM perf tbl
  JOIN keys ON (keys.c1 = tbl.c1 AND keys.seq = tbl.seq)
;
   QUERY PLAN
-
 Hash Join  (cost=958.00..1569.00 rows=1 width=262) (actual time=14.902..24.311 rows=1 loops=1)
   Output: tbl.c1, tbl.seq, tbl.c2
   Inner Unique: true
   Hash Cond: ((tbl.c1 = perf.c1) AND (tbl.seq = (min(perf.seq
   Buffers: shared hit=856
   ->  Seq Scan on public.perf tbl  (cost=0.00..548.00 rows=12000 width=262) (actual time=0.004..1.813 rows=12000 loops=1)
 Output: tbl.c1, tbl.seq, tbl.c2
 Buffers: shared hit=428
   ->  Hash  (cost=808.00..808.00 rows=1 width=8) (actual time=14.880..14.881 rows=1 loops=1)
 Output: perf.c1, (min(perf.seq))
 Buckets: 16384  Batches: 1  Memory Usage: 519kB
 Buffers: shared hit=428
 ->  HashAggregate  (cost=608.00..708.00 ro

Re: Underestimated number of output rows with an aggregate function

2023-10-16 Thread Philippe BEAUDOIN

Le 15/10/2023 à 18:37, Tom Lane a écrit :

Philippe BEAUDOIN  writes:

During my analysis, I realized that the output rows estimate of the
second CTE is really bad, leading to a bad plan for the next CTE.
I reproduced the issue in a very small test case with a simplified
query. Attached is a shell script and its output.

Yeah.  If you try it you'll see that the estimates for the
"keys.c1 = tbl.c1" and "keys.seq = tbl.seq" clauses are spot-on
individually.  The problem is that the planner assumes that they
are independent clauses, so it multiplies those selectivities together.
In reality, because seq is already unique, the condition on c1 adds
no additional selectivity.

If seq is guaranteed unique in your real application, you could just
drop the condition on c1.  Otherwise I'm not sure about a good
answer.  In principle creating extended stats on c1 and seq should
help, but I think we don't yet apply those for join clauses.

A partial answer could be to defeat application of the table's
statistics by writing

   JOIN keys ON (keys.c1 = tbl.c1+0 AND keys.seq = tbl.seq+0)

For me this gives an output estimate of 3000 rows, which is still not
great but should at least prevent choice of an insane plan at the
next join level.  However, it pessimizes the plan for this query
itself a little bit (about doubling the runtime).


Thanks for the trick (and the quick answer). In the test case, it 
effectively brings a pretty good plan.


Unfortunately, as these statements are generated and depend on the base 
table structure, the issue remains for some of them (but not all). So, 
for the moment at least, I keep the previous workaround (disabling 
nested loops).



For the hash join node, the plan shows a "Inner Unique: true" property.
I wonder if this is normal.

Sure.  The output of the WITH is visibly unique on c1.

OK, I see.

regards, tom lane