On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra <t...@fuzzy.cz> wrote: > Yes, I was aware of this problem (amount of memory consumed with lots of > updates), and I kind of hoped someone will come up with a reasonable > solution.
As far as I can see, periodically sampling some or all of the table is really the only thing that has a chance of working OK. You could try to track changes only when they're not too big, but I think that's still going to have nasty performance consequences. > I was thinking about it and I believe at least one of the algorithms > (the "adaptive sampling") might handle "merging" i.e. the backends would > not need to store the list of values, just a private copy of the > estimator and update it. And at the end (after commit), the estimator > would be merged back into the actual one. > > So the algorithm would be something like this: > > 1. create copy for all distinct estimators influenced by the INSERT > 2. update the local copy > 3. commit and merge the local copies back into the original estimator Well, maybe. But DELETEs seem like a problem. And it's still adding foreground work to every query, which I just have a hard time believing is ever going to work out > Regarding the crash scenario - if the commit fails, just throw away the > local estimator copy, it's not needed. I'm not sure how to take care of > the case when commit succeeds and the write of the merged estimator > fails, but I think it might be possible to write the estimator to xlog > or something like that. So it would be replayed during recovery etc. Or > is it a stupid idea? It's not stupid, in the sense that that is what you'd need to do if you want to avoid ever having to rescan the table. But it is another thing that I think is going to be way too expensive. > Yes, as I've mentioned above, the multi-column stats are the reason why > I started working on this. And yes, it really should work like this: > > 1. user realizes there's something wrong as the plans are bad > 2. after analysis, the user realizes the reason is an imprecise > estimate due to dependence between columns > 3. the user enables cross-column stats on the columns and checks > if it actually solved the issue > 4. if the cost outweights the gains, the user drops the stats > > Does that sound reasonable? Yes. The only caveat I'm trying to insert is that it's hard for me to see how the proposed approach could ever be cheap enough to be a win. If we need some kind of more expensive kind of ANALYZE that scans the whole table, and it's optional, sure, why not? But that's a one-time hit. You run it, it sucks, it's over, and now your queries work. Odds are good you may never need to touch it again. Now, if we can convince ourselves that multi-column stats are likely to require constant updating, then maybe there would be more to recommend the stream-based approach, although even then it seems dreadfully complex and expensive to me. But I bet these things are pretty static. If the city and state tend to imply the zip code when there are 10K rows in the table, they probably still will when there are 100K rows in the table. If users with org_id = 1 have a higher karma score on average than users with org_id != 1, that'll probably still be true when there are more users in both classes. Whether the database can understand that without rescanning the table depends on the data representation, of course, but I guess I'm just saying I'd think really, really hard before abandoning the idea of periodic sampling. You have to get an awful lot of benefit out of those cross-column stats to make it worth paying a run-time cost for them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers