On Tue, Nov 22, 2016 at 11:18 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Peter Geoghegan <p...@heroku.com> writes: >> On Tue, Nov 22, 2016 at 7:31 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: >>> Oracle spends a lot of time on this, and it's really cache-inefficient >>> because the data is spread all over. This was what this guy felt in >>> circa 2001; I'd have to think that the cache unfriendliness problem is >>> much worse for modern hardware. > >> I imagine that temporal locality helps a lot. Most snapshots will be >> interested in the same row version. > > Yeah, but the problem is that all those transactions have to scavenge > through a big chunk of UNDO log to find out what they need to know.
In my design, there's one UNDO log per transaction, so if a page has been recently modified only by a single transaction whose effects are visible to your snapshot, you can very quickly decide that you don't need to look at UNDO at all. If it's been recently modified by multiple transactions all of which are visible to your snapshot, you can consult the TPD and then decide that none of those transactions are interesting and therefore their UNDO doesn't need to be examined. When there are transactions that whose effects are not visible to your snapshot, the cost is proportional to the number of times those transactions have modified the page. Therefore, the "bad" case for this design is when the same page gets updated a large number of times by one or more concurrent transactions. If the transaction you can't see has made 437 separate modifications to the page, you're going to need to look at all 437 UNDO records to figure out what you can see. That is likely to suck. I think that one of the crucial questions for this design is how often that will happen. For people with mostly static data, it should be rare. In something like a pgbench workload, it boils down to how often multiple transactions hit the same page before the previous transactions that hit that page become all-visible. Of course, if you have too many backends relative to the scale factor, pgbench is going to become lock-bound anyway and it probably won't matter much, but it's about two orders of magnitude easier to hit the same page than to hit the same tuple, so it could be that the point where you start incurring significant costs for visibility checking and page reconstruction happens well before the tuple lock contention becomes noticeable. It might be worth doing some mathematical modeling to try to estimate how serious that effect is likely to be. There are also some things that you can do to mitigate this, although they add even more complexity. For example: - If the same transaction touches the same page repeatedly, you could make a rule that every 5th or every 10th or every 20th modification to the same page within the same transaction emits a larger summary record into the UNDO that is a consolidated record of all changes made to the page by that transaction thus far. That penalizes writers, but helps readers; when a reader reaches a summary record, it need look no further. - Peter alluded to the possibility - or so I thought - of caching reconstructed page versions. That would make a lot of sense for sequential scans or cases where the same page is being accessed repeatedly. - If you're doing an index scan and you only care about one tuple, the page could contain one bit per tuple which is set if the tuple has been modified by any transaction the page views as recent. If the bit is clear for the one tuple you care about, you can ignore the UNDO even if is for a transaction not visible to your snapshot, because you know it didn't touch that tuple. - If you create a TPD entry for the page because there are multiple recent transactions that have modified it, you can put additional information into the TPD entry to speed up access. For example, you could store an array indexed by itemid which indicates which of the recent transactions have touched any given itemid. This is mostly useful for index scans, which can then ignore all the transactions that don't pertain to the one TID they care about. Which of those ideas are best? Are they enough, even if we do them all? Will there be other down sides? I don't know. But I think it's clear that bloat is an ongoing and serious concern for our users (cf. snapshot too old), and that vacuuming is too (cf. freeze map) and I believe we are reaching a point where we can't make much further improvement in those areas without some kind of major architectural change. WARM might qualify; this idea certainly does. I also believe that we NEED to improve further. At least for EnterpriseDB, the problems I'm trying to address through this design are big problems. If this design isn't good, let's come up with something else. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers