On Tue, Jun 14, 2011 at 4:04 PM, Robert Haas <robertmh...@gmail.com> wrote: > 1. ANALYZE happens far too infrequently to believe that any data taken > at ANALYZE time will still be relevant at execution time. > 2. Using data gathered by ANALYZE will make plans less stable, and our > users complain not infrequently about the plan instability we already > have, therefore we should not add more. > 3. Even if the data were accurate and did not cause plan stability, we > have no evidence that using it will improve real-world performance.
I feel like this is all baseless FUD. ANALYZE isn't perfect but it's our interface for telling postgres to gather stats and we generally agree that having stats and modelling the system behaviour as accurately as practical is the right direction so we need a specific reason why this stat and this bit of modeling is a bad idea before we dismiss it. I think the kernel of truth in these concerns is simply that everything else ANALYZE looks at mutates only on DML. If you load the same data into two databases and run ANALYZE you'll get (modulo random sampling) the same stats. And if you never modify it and analyze it again a week later you'll get the same stats again. So autovacuum can guess when to run analyze based on the number of DML operations, it can run it without regard to how busy the system is, and it can hold off on running it if the data hasn't changed. In the case of the filesystem buffer cache the cached percentage will vary over time regardless of whether the data changes. Plain select queries will change it, even other activity outside the database will change it. There are a bunch of strategies for mitigating this problem: we might want to look at the cache situation more frequently, discount the results we see since more aggressively, and possibly maintain a kind of running average over time. There's another problem which I haven't seen mentioned. Because the access method will affect the cache there's the possibility of feedback loops. e.g. A freshly loaded system prefers sequential scans for a given table because without the cache the seeks of random reads are too expensive... causing it to never load that table into cache... causing that table to never be cached and never switch to an index method. It's possible there are mitigation strategies for this as well such as keeping a running average over time and discounting the estimates with some heuristic values. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers