Re: POC: Cleaning up orphaned files using undo logs

Amit Kapila Fri, 28 Jun 2019 02:57:36 -0700

On Sat, Jun 22, 2019 at 2:51 AM Robert Haas <robertmh...@gmail.com> wrote:
>
> On Thu, Jun 20, 2019 at 4:54 AM Dilip Kumar <dilipbal...@gmail.com> wrote:
> > The idea behind having the loop inside the undo machinery was that
> > while traversing the blkprev chain, we can read all the undo records
> > on the same undo page under one buffer lock.
>
> That's not a bad goal, although invoking a user-supplied callback
> while holding a buffer lock is a little scary.  If we stick with that,
> it had better be clearly documented.  Perhaps worth noting:
> ReadBuffer() is noticeably more expensive in terms of CPU consumption
> than LockBuffer().  So it may work out that we keep a pin to avoid
> redoing that and worry less about retaking the buffer locks.  But I'm
> not sure: avoiding buffer locks is clearly good, too.
>
> I have a couple of observations after poking at this some more.  One
> is that it's not necessarily adequate to have an interface that
> iterates backward through the undo chain until the callback returns
> true.  Here's an example: suppose you want to walk backward through
> the undo chain until you find the first undo record that corresponds
> to a change you can see, and then return the undo record immediately
> prior to that.  zheap doesn't really need this, because it (mostly?)
> stores the XID of the next record we're going to look up in the
> current record, and the XID of the first record we're going to look up
> in the chain -- so it can tell from the record it's found so far
> whether it should bother looking up the next record. That, however,
> would not necessarily be true for some other AM.
>
> Suppose you just store an undo pointer in the tuple header, as Heikki
> proposed to do for zedstore.  Suppose further that each record has the
> undo pointer for the previous record that modified that TID but not
> necessarily the XID.  Then imagine a TID where we do an insert and a
> bunch of in-place updates.  Then, a scan comes along with an old
> snapshot.  It seems to me that what you need to do is walk backward
> through the chain of undo records until you see one that has an XID
> that is visible to your snapshot, and then the version of the tuple
> that you want is in the payload of the next-newer undo record. So what
> you want to do is something like this:
>
> look up the undo pointer in the tuple.  call that the current undo record.
> loop:
> - look up the undo pointer in the current undo record.  call that the
> previous undo record.
> - if the XID from the previous undo record is visible, then stop; use
> the tuple version from the current undo record.
> - release the current undo record and let the new current undo record
> be the previous undo record.
>
> I'm not sure if this is actually a reasonable design from a
> performance point of view, but it doesn't sound totally crazy, and
> it's just to illustrate the point that there might be cases that are
> too complicated for a loop-until-true model.  In this kind of loop, at
> any given time you are holding onto two undo records, working your way
> back through the undo log, and you just can't make this work with the
> UndoFetchRecord callback interface. Possibly you could have a context
> object that could hold onto one or a few buffer pins:
>
> BeginUndoFetch(&cxt);
> uur = UndoFetchRecord(&cxt, urp); // maybe do this a bunch of times
> FinishUndoFetch(&cxt);
>
> ...but I'm not sure if that's exactly what we want either. Still, if
> there is a significant savings from avoiding repinning and relocking
> the buffer, we want to make it easy for people to get that advantage
> as often as possible.
>


IIUC, you are proposing to retain the pin and then lock/unlock the
buffer each time in this API.  I think there is no harm in trying out
something on these lines to see if there is any impact on some of the
common scenarios.

> Another point that has occurred to me is that it is probably
> impossible to avoid a fairly large number of duplicate undo fetches.
> For instance, suppose somebody runs an UPDATE on a tuple that has been
> recently updated.  The tuple_update method just gets a TID + snapshot,
> so the AM basically has to go look up the tuple all over again,
> including checking whether the latest version of the tuple is the one
> visible to our snapshot.  So that means repinning and relocking the
> same buffers and decoding the same undo record all over again.  I'm
> not exactly sure what to do about this, but it seems like a potential
> performance problem. I wonder if it's feasible to cache undo lookups
> so that in common cases we can just reuse the result of a previous
> lookup instead of doing a new one, and I wonder whether it's possible
> to make that fast enough that it actually helps...
>

I think it will be helpful if we can have such a cache, but OTOH, we
can try out such optimizations after the first version as well by
analyzing its benefit.  For zheap, in many cases, the version in heap
itself is all-visible and or is visible as per current snapshot and
the same can be detected by looking at the transaction slot, however,
it might be tricky for zedstore.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

Reply via email to