On Thu, Aug 6, 2020 at 5:35 PM Tom Lane <t...@sss.pgh.pa.us> wrote: > > It's not clear to me that it would even be correct to categorize those > > somewhat-different results as "less accurate." > > Estimating two rows where the correct answer is one row is clearly > "less accurate" [ than estimating one row ].
That's a tautology, so I can't argue with it as far as it goes. Thinking about it more, there are really two ways to think about an estimated row count. On the one hand, if you think of the row count estimate as the number of rows that are going to pop out of a node, then it's always right to think of a unique index as limiting the number of occurrences of a given value to 1. But, if you think of the row count estimate as a way of estimating the amount of work that the node has to do to produce that output, then it isn't. If a table has a lot of inserts and deletes, or a lot of updates, index scans might have to do a lot of extra work chasing down index pointers to tuples that end up being invisible to our scan. The scan may not have any filter quals at all, and even if it does, they are likely cheap to evaluate compared to the cost of finding a locking buffers and checking visibility, so the dominant cost of the scan is really based on the total number of rows that are present, not the number that are visible. Ideally, the presence of those rows would affect the cost estimate for the node in a way very similar to expecting to find more rows. At the same time, it doesn't work to just bump up the row count estimate for the node, because then you'll think more rows will be output, which might cause poor planning decisions at higher levels. It doesn't seem easy to get this 100% right. Tuple visibility can change very quickly, much faster than the inter-ANALYZE interval. And sometimes tuples can be pruned away very quickly, too, and the index pointers may be opportunistically removed very quickly, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company