On Sat, Jun 22, 2019 at 2:51 AM Robert Haas <robertmh...@gmail.com> wrote: > > On Thu, Jun 20, 2019 at 4:54 AM Dilip Kumar <dilipbal...@gmail.com> wrote: > > The idea behind having the loop inside the undo machinery was that > > while traversing the blkprev chain, we can read all the undo records > > on the same undo page under one buffer lock. > > That's not a bad goal, although invoking a user-supplied callback > while holding a buffer lock is a little scary. If we stick with that, > it had better be clearly documented. Perhaps worth noting: > ReadBuffer() is noticeably more expensive in terms of CPU consumption > than LockBuffer(). So it may work out that we keep a pin to avoid > redoing that and worry less about retaking the buffer locks. But I'm > not sure: avoiding buffer locks is clearly good, too. > > I have a couple of observations after poking at this some more. One > is that it's not necessarily adequate to have an interface that > iterates backward through the undo chain until the callback returns > true. Here's an example: suppose you want to walk backward through > the undo chain until you find the first undo record that corresponds > to a change you can see, and then return the undo record immediately > prior to that. zheap doesn't really need this, because it (mostly?) > stores the XID of the next record we're going to look up in the > current record, and the XID of the first record we're going to look up > in the chain -- so it can tell from the record it's found so far > whether it should bother looking up the next record. That, however, > would not necessarily be true for some other AM. > > Suppose you just store an undo pointer in the tuple header, as Heikki > proposed to do for zedstore. Suppose further that each record has the > undo pointer for the previous record that modified that TID but not > necessarily the XID. Then imagine a TID where we do an insert and a > bunch of in-place updates. Then, a scan comes along with an old > snapshot. It seems to me that what you need to do is walk backward > through the chain of undo records until you see one that has an XID > that is visible to your snapshot, and then the version of the tuple > that you want is in the payload of the next-newer undo record. So what > you want to do is something like this: > > look up the undo pointer in the tuple. call that the current undo record. > loop: > - look up the undo pointer in the current undo record. call that the > previous undo record. > - if the XID from the previous undo record is visible, then stop; use > the tuple version from the current undo record. > - release the current undo record and let the new current undo record > be the previous undo record. > > I'm not sure if this is actually a reasonable design from a > performance point of view, but it doesn't sound totally crazy, and > it's just to illustrate the point that there might be cases that are > too complicated for a loop-until-true model. In this kind of loop, at > any given time you are holding onto two undo records, working your way > back through the undo log, and you just can't make this work with the > UndoFetchRecord callback interface. Possibly you could have a context > object that could hold onto one or a few buffer pins: > > BeginUndoFetch(&cxt); > uur = UndoFetchRecord(&cxt, urp); // maybe do this a bunch of times > FinishUndoFetch(&cxt); > > ...but I'm not sure if that's exactly what we want either. Still, if > there is a significant savings from avoiding repinning and relocking > the buffer, we want to make it easy for people to get that advantage > as often as possible. >
IIUC, you are proposing to retain the pin and then lock/unlock the buffer each time in this API. I think there is no harm in trying out something on these lines to see if there is any impact on some of the common scenarios. > Another point that has occurred to me is that it is probably > impossible to avoid a fairly large number of duplicate undo fetches. > For instance, suppose somebody runs an UPDATE on a tuple that has been > recently updated. The tuple_update method just gets a TID + snapshot, > so the AM basically has to go look up the tuple all over again, > including checking whether the latest version of the tuple is the one > visible to our snapshot. So that means repinning and relocking the > same buffers and decoding the same undo record all over again. I'm > not exactly sure what to do about this, but it seems like a potential > performance problem. I wonder if it's feasible to cache undo lookups > so that in common cases we can just reuse the result of a previous > lookup instead of doing a new one, and I wonder whether it's possible > to make that fast enough that it actually helps... > I think it will be helpful if we can have such a cache, but OTOH, we can try out such optimizations after the first version as well by analyzing its benefit. For zheap, in many cases, the version in heap itself is all-visible and or is visible as per current snapshot and the same can be detected by looking at the transaction slot, however, it might be tricky for zedstore. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com