On Mon, Jul 29, 2019 at 3:39 PM Peter Geoghegan <p...@bowt.ie> wrote: > I'm not saying you can't handle it. But that necessitates "write > amplification", in the sense that you must now create new index tuples > even for indexes where the indexed columns were not logically altered. > Isn't zheap supposed to fix that problem, at least at in version 2 or > version 3? I also think that stable heap TIDs make index-only scans a > lot easier and more effective.
I think there's a cost-benefit analysis here. You're completely correct that inserting new index tuples causes write amplification and, yeah, that's bad. On the other hand, row forwarding has its own costs. If a row ends up persistently moved to someplace else, then every subsequent access to that row has an extra level of indirection. If it ends up split between two places, every read of that row incurs two reads. The "someplace else" where moved rows or ends of split rows are stored has to be skipped by sequential scans, which is complex and possibly inefficient if it breaks up a sequential I/O pattern. Those things are bad, too. It's a little difficult to compare the kinds of badness. My thought is that in the short run, the redirect strategy probably wins, because there could be and likely are a bunch of indexes and it's cheaper to just insert one redirect. But in the long term, the redirect thing seems like a loser, because you have to keep following it. That (perhaps naive) analysis is why zheap doesn't try to maintain TID stability. Instead it wants to do in-place updates (no new TID) as often as possible, but the fallback strategy is simply to do a non-in-place update (new TID) rather than a redirect. > I think that indexes (or at least B-Tree indexes) will ideally almost > always have tuples that are the latest versions with zheap. The > exception is tuples whose ghost bit is set, whose visibility varies > based on the MVCC snapshot in use. But the instant that the > deleting/updating xact commits it becomes legal to recycle the old > heap TID. We don't need to go back to the index to permanently zap the > tuple whose ghost bit we already set, because there is an undo pointer > in the same leaf page, so nobody is in danger of getting confused and > following the now-recycled heap TID. I haven't run across the "ghost bit" terminology before. Is there a good place to read about the technique you're assuming here? A major question is how you handle inserted rows, that are new now and thus not yet visible to everyone, but which will later become all-visible. One idea is: if the undo pointer is new enough that a write transaction which modified the page could still be in-flight, check the undo log to ascertain visibility of index tuples. If not, then any potentially-deleted index tuples are in fact deleted, and any others are all-visible. With this design, you don't set the ghost bit on new tuples, but are still able to stop following the undo pointers for them after a while. To put that another way, there seems to be pretty clearly a need for a bit, but what does the bit mean? It could mean "please check the undo log," in which case it'd have to be set on insert, eventually cleared, and then reset on delete, but I think that's likely to suck. I think therefore that the bit should mean is-deleted-but-not-necessarily-all-visible-yet, which avoids that problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company