Le 25/04/2024 à 22:21, Robert Haas a écrit :
The analyze case, I feel, is really murky. autovacuum_analyze_scale_factor stands for the proposition that as the table becomes larger, analyze doesn't need to be done as often. If what you're concerned about is the frequency estimates, that's true: an injection of a million new rows can shift frequencies dramatically in a small table, but the effect is blunted in a large one. But a lot of the cases I've seen have involved the histogram boundaries. If you're inserting data into a table in increasing order, every new million rows shifts the boundary of the last histogram bucket by the same amount. You either need those rows included in the histogram to get good query plans, or you don't. If you do, the frequency with which you need to analyze does not change as the table grows. If you don't, then it probably does. But the answer doesn't really depend on how big the table is already, but on your workload. So it's unclear to me that the proposed parameter is the right idea here at all. It's also unclear to me that the existing system is the right idea. 🙂
This is very interesting. And what about ndistinct? I believe it could be problematic, too, in some (admittedly rare or pathological) cases.
For example, suppose that the actual number of distinct values grows from 1000 to 200000 after a batch of insertions, for a particular column. OK, in such a case, the default analyze sampling isn't large enough to compute a ndistinct close enough to reality anyway. But without any analyze at all, it can lead to very bad planning - think of a Nested Loop with a parallel seq scan for the outer table instead of a simple efficient index scan, because the index scan of the inner table is overestimated (each index scan cost and number or rows returned).