On 24/09/2024 08:08, Andrei Lepikhov wrote:
On 19/9/2024 09:55, Andrei Lepikhov wrote:
This wrong prediction makes things much worse if the query has more
upper query blocks.
His question was: Why not consider the grouping column unique in the
upper query block? It could improve estimations.
After a thorough investigation, I discovered that in commit
4767bc8ff2 most of the work was already done for DISTINCT clauses. So,
why not do the same for grouping? A sketch of the patch is attached.
As I see it, grouping in this sense works quite similarly to DISTINCT,
and we have no reason to ignore it. After applying the patch, you can
see that prediction has been improved:
Hash Right Join (cost=5.62..162.56 rows=50 width=36)
A regression test is added into new version.
The code looks tiny, simple and non-invasive - it will be easy to commit
or reject. So I add it to next commitfest.
Looks good at a quick glance.
@@ -5843,11 +5852,11 @@ get_variable_numdistinct(VariableStatData *vardata,
bool *isdefault)
}
/*
- * If there is a unique index or DISTINCT clause for the variable,
assume
- * it is unique no matter what pg_statistic says; the statistics could
be
- * out of date, or we might have found a partial unique index that
proves
- * the var is unique for this query. However, we'd better still believe
- * the null-fraction statistic.
+ * If there is a unique index, DISTINCT or GROUP-BY clause for the
variable,
+ * assume it is unique no matter what pg_statistic says; the statistics
+ * could be out of date, or we might have found a partial unique index
that
+ * proves the var is unique for this query. However, we'd better still
+ * believe the null-fraction statistic.
*/
if (vardata->isunique)
stadistinct = -1.0 * (1.0 - stanullfrac);
I wonder about the "we'd better still believe the null-fraction
statistic" part. It makes sense for a unique index, but a DISTINCT or
GROUP BY collapses all the NULLs to a single row. So I think there's
some more work to be done here.
--
Heikki Linnakangas
Neon (https://neon.tech)