On Wed, Jun 29, 2011 at 3:16 PM, Merlin Moncure <mmonc...@gmail.com> wrote: > I think it's a fair point to ask how often thrashing cases will truly > come up where you don't have some other significant cost like i/o. > Even when you do thrash, you are just falling back on stock postgres > behaviors minus the maintenance costs.
Agree. > With the algorithm as coded, to fully flush the cache you'd have to > find a series of *unhinted* tuples distributed across minimum of four > 64k wide page ranges, with the number of tuples being over the 5% > noise floor. Furthermore, to eject a cache page you have to have more > hits than a page already in the cache got -- hits are tracked on the > existing pages and the candidate pages are effectively with the > existing pages based on hits with the best four picked. Also, is it > reasonable to expect the cache to help in these types of cases > anyways? *scratches head* Well, surely you must need to reset the counters for the pages you've chosen to include in the cache at some point. If you don't, then there's a strong likelihood that after some period of time the pages you have in there will become so firmly nailed into the cache that they can never be evicted. To be clear, I don't really think it matters how sensitive the cache is to a *complete* flush. The question I want to ask is: how much does it take to knock ONE page out of cache? And what are the chances of that happening too frequently? It seems to me that if a run of 100 tuples with the same previously-unseen XID is enough to knock over the applecart, then that's not a real high bar - you could easily hit that limit on a single page. And if that isn't enough, then I don't understand the algorithm. > If your concern is the cost of replacement, then you can > micro-optimize (a hand rolled qsort alone should squeeze out quite a > bit of fat) or experiment with a new algorithm as you suggested. > However i should note that at least for pgbench fsync=off oltp tests > (the only way I could think of to exercise cache replacement), it > (replacement costs) did not show up on the radar. It might be useful, just for debugging purposes, to add an elog(LOG) in there that triggers whenever a cache flush occurs. And then play with different workloads and try to make it happen as often as you can. One idea I had was to hack the XID generator so that it only returns every (say) 200th XID, instead of consecutive ones. That might make it easier to identify the sorts of workloads that are prone to causing problems, and then we could further analyze those workloads and see whether they are a problem in real life, and/or what can be done to minimize the impact. > OTOH, If your main concern is about losing data out of the cache via > page swap, increasing the number of cache pages (or use smaller ones) > might work but this incrementally makes the testing the cache more > expensive. Yeah, I am inclined to think that going very far in that direction is going to be a big pile of fail. TBH, I'm somewhat surprised that you managed to get what you already have to work without a performance regression. That code path is extremely hot, and injecting more complexity seems like it ought to be a loser. For purposes of this review, I'm taking it as given that you've tested this carefully, but I'll admit to lingering skepticism. > I did briefly experiment with trying to bootstrap incoming cache pages > with data gathered out of the clog -- but it didn't help much and was > a lot more trouble than worth. I believe it. > One possible enhancement would be to try and bias pages with more data > in them so that it's more difficult to get them out -- or even > maintain separate pools with ejection priority. I'm not in a hurry > and suggestions are welcome. I think the basic problem is that the cache pages are so large. It's hard to make them smaller because that increases the cost of accessing the cache, as you already noted, but at the same time, making an eviction decision on a cache that holds 64K entries based on 100 data points seems like waving around a pretty large-caliber weapon in a fairly uncontrolled fashion. Maybe it works OK 90% of the time, but it's hard to avoid the niggling fear that someone might accidentally get shot. Just for the sake of argument, let's suppose we made an array with 64K elements, each one representing one 64K chunk of the XID space. Each element is a 4-byte unsigned integer, so the array takes up 256kB of memory... probably not good, but I'm just thinking out loud here. Every time we're asked about an XID, we bump the corresponding counter. Periodically (say, every 64K probes) we zip through all the counters and consider performing a cache replacement. We look at the not-already-cached page that has the highest counter value and compare it to the counter value for the cached pages. If it is higher and the difference exceeds some safety margin (to protect against hysteresis), then we do the replacement. I think something like this would allow you to have a lot more information available to make a direct apples-to-apples comparison at cache replacement time, as compared with what you have now. Now, the big problem here is that a 256kB array is probably too big, but because we don't want to blow out the lower-level CPU caches and also because every time we consider performing a cache replacement we're going to want to zero the whole thing, and that has to be fast. But we probably don't really need the array to cover the entire XID space. vacuum_freeze_min_age defaults to 50 million transactions, and vacuum_freeze_table_age defaults to 150 million transactions. If we only concern ourselves with the last, say, 256 million transactions, which ought to be way more than we will need for almost any practical purpose, then that's just 1/16th of the total XID space. If we only worry about the last 64 million transactions, then that's just 1/64th of the total XID space. Possibly we could go take an even smaller slice than that. At any rate now you're talking about a much smaller array, although sadly with a bit more bookkeeping to keep track of which portion of the XID space we're currently tracking. And maybe you could also use 2-byte counters rather than 4-byte counters, if you're careful to set it up so they can't overflow. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers