I was looking at some stubborn queries in one of our applications, and not knowing the internals of the query planner, thought I might ask if this planner improvement is possible at all. We have queries with the general form of IN (SELECT FROM AA JOIN (SELECT foo UNION ALL SELECT bar)) clauses. (There's a view here, but that's not the planner's concern.) The corresponding part of the plan looks like this:

-> HashAggregate (cost=135960.26..135960.26 rows=200 width=4)
-> Nested Loop (cost=1.14..130852.49 rows=2043108 width=4)
-> Index Scan using AA_pkey on AA (cost=0.00..85.01 rows=1 width=12)
-> Subquery Scan BB (cost=1.14..110336.40 rows=2043108 width=36)
-> Append (cost=1.14..89905.32 rows=2043108 width=16)
-> Subquery Scan "*SELECT* 1" (cost=1.14..75017.28 rows=1634954 width=16)
-> Subquery Scan "*SELECT* 2" (cost=0.00..14888.04 rows=408154 width=16)


The problem for us is that the default estimate at the HashAggregate is absurdly low, undercutting the other available join candidates' row estimates resulting in _bad_ plans. What I was wondering is whether the planner has enough information available when looking at the HashAggregate step (the IN clause that started this whole mess) to consider the statistics of the inputs to the Append. In our case, I know that the second Subquery scan is generating strictly unique values, so the HashAggregate cannot produce fewer than 408154 rows itself. Additionally, those subquery scans are simple queries without anything to cause the table statistics to be thrown away if they can be reused in later steps (or is the rowcount the only stat used after a node is created?) -- one subquery is a single-table filter, the other is a two table join, also nothing special.

Essentially, does the data exist for the planner to say at the HashAggregate, "a-ha! Subquery 2 estimates 400000 distinct rows!" and estimate something >= 400000?

If this is improved in 8.0, my apologies; we haven't had the opportunity to begin a migration, and the release notes only say there are a lot of small improvements. If this is a special case of a long-standing limitation, I'm not seeing the general case this fits in to, and I'll happily sit down and be quiet. :-)

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
     joining column's datatypes do not match

Reply via email to