Re: Use extended statistics to estimate (Var op Var) clauses

Tomas Vondra Wed, 11 Aug 2021 15:00:35 -0700


On 8/11/21 5:17 PM, Mark Dilger wrote:

On Aug 11, 2021, at 7:51 AM, Mark Dilger <[email protected]> wrote:

I'll go test random data designed to have mcv lists of significance....

Done. The data for column_i is set to floor(random()^i*20).column_1 therefore is evenly distributed between 0..19, with

successive columns weighted more towards smaller values.

This still gives (marginally) worse results than the original test I
posted, but better than the completely random data from the last post.
After the patch, 72294 estimates got better and 30654 got worse.  The
biggest losers from this data set are:

better:0, worse:31:  A >= B or A = A or not A = A
better:0, worse:31:  A >= B or A = A
better:0, worse:31:  A >= B or not A <> A
better:0, worse:31:  A >= A or A = B or not B = A
better:0, worse:31:  A >= B and not A < A or A = A
better:0, worse:31:  A = A or not A > B or B <> A
better:0, worse:31:  A >= B or not A <> A or not A >= A
better:0, worse:32:  B < A and B > C and not C < B                    <----
better:1, worse:65:  A <> C and A <= B                                  <----
better:0, worse:33:  B <> A or B >= B
better:0, worse:33:  B <> A or B <= B
better:0, worse:33:  B <= A or B = B or not B > B
better:0, worse:33:  B <> A or not B >= B or not B < B
better:0, worse:33:  B = A or not B > B or B = B
better:0, worse:44:  A = B or not A > A or A = A
better:0, worse:44:  A <> B or A <= A
better:0, worse:44:  A <> B or not A >= A or not A < A
better:0, worse:44:  A <= B or A = A or not A > A
better:0, worse:44:  A <> B or A >= A

Of which, a few do not contain columns compared against themselves,
marked with <---- above.

I don't really know what to make of these results. It doesn'tbother me that any particular estimate gets worse after the patch.

That's just the nature of estimating.  But it does bother me a bit
that some types of estimates consistently get worse.  We should
either show that my analysis is wrong about that, or find a way to
address it to avoid performance regressions.  If I'm right that there
are whole classes of estimates that are made consistently worse, then
it stands to reason some users will have those data distributions and
queries, and could easily notice.

I'm not quite sure that's really a problem. Extended statistics aremeant for correlated columns, and it's mostly expected the estimates maybe a bit worse for random / independent data. The idea is mostly thatstatistics will be created only for correlated columns, in which case itshould improve the estimates. I'd be way more concerned if you observedconsistently worse estimates on such data set.

Of course, there may be errors - the incorrect handling of (A op A) isan example of such issue, probably.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Use extended statistics to estimate (Var op Var) clauses

Reply via email to