Hello, Here is a patch to improve rowcount estimates for `UNNEST(some_array_column)`. Today we hard code this to 10, but we have statistics about array size, so it's easy to use them.
I've seen plans where this would make a difference. If the array has only 1 or 2 elements, then overestimating the rowcount by 10 leads to unnecessary seqscans downstream. I can see how an underestimate would cause issues too. This patch builds on a391ff3c3d41 which allowed set-returning functions like UNNEST to include a support function to estimate their result count. (There is a nice writeup at https://www.cybertec-postgresql.com/en/optimizer-support-functions/) But that patch only changes UNNEST if it has a Const or ArrayExpr argument. The statistic I'm using is the last value in the DECHIST array, which is the average number of distinct elements in the array. Using the plain (non-distinct) element count would be more accurate, but we don't have that, and using distinct elements is still better than a hardcoded 10. The real change is in estimate_array_length, which has several callers besides array_unnest_support, but I think this change should give more accurate estimates for all of them. There is a comment that estimate_array_length must agree with scalararraysel. I don't think this commit introduces any discrepancies. The most relevant case there is `scalar = ANY/ALL (array)`, which also consults DECHIST (and/or MCELEM). I wasn't sure where to put a test. I finally settled on arrays.sql since (1) that has other UNNEST tests (2) array_unnest_support is in util/adt/arrayfuncs.c (3) I couldn't find a place devoted to rowcount/selectivity estimates. I'm happy to move it if someone has a better idea! Based on 712dc2338b23. Yours, -- Paul ~{:-) p...@illuminatedcomputing.com
v1-0001-Use-statitics-for-estimating-UNNEST-column-rows.patch
Description: Binary data