One problem I've seen in multiple databases and is when a table has a mixture of data sets within it. E.g. A queue table where 99% of the entries are "done" but most queries are working with the 1% that are "new" or in other states. Often the statistics are skewed by the "done" entries and give bad estimates for query planning when the query is actually looking at the other rows.
We've always talked about this as a "skewed distribution" or "intercolumn correlation" problem. And we've developed some tools for dealing with those issues. But I've been thinking that's not the only problem with these cases. The problem I'm finding is that the distribution of these small subsets can swing quickly. And understanding intercolumn correlations even if we could do it perfectly would be no help at all. Consider a table with millions of rows that are "done" but none that are "pending". Inserting just a few hundred or thousand new pending rows makes any estimates based on the existing statistics entirely incorrect. Even if we had perfect statistics capable of making perfect estimates they would be entirely wrong once a few inserts of pending rows are done. Worse, this is kind of true for even n_dead_tup, n_mod_since_analyze, etc are kind of affected by this. It's easy (at least on older versions, maybe Peter's work has fixed this for btree) to get severe index bloat because vacuum doesn't run for a long time relative to the size of the busy portion of a table. I'm imagining to really tackle this we should be doing something like noticing when inserts, updates, deletes are affecting key values that are "rare" according to the statistics and triggering autovacuum ANALYZE statements that use indexes to only update the statistics for the relevant key ranges. Obviously this could get complex quickly. Perhaps it should be something users could declare. Some kind of "partitioned statistics" where you declare a where clause and we generate statistics for the table where that where clause is true. Then we could fairly easily also count things like n_mod_since_analyze for that where clause too. And yes, partitioning the table could be a solution to this in some cases. I think there are reasons why it isn't always going to work for these issues though, not least that users will likely have other ways they want to partition the data already. -- greg