Re: should there be a hard-limit on the number of transactions pending undo?

Robert Haas Mon, 29 Jul 2019 09:36:37 -0700

On Fri, Jul 19, 2019 at 7:28 PM Peter Geoghegan <p...@bowt.ie> wrote:
> If I'm not mistaken, you're tacitly assuming that you'll always be
> using zheap, or something sufficiently similar to zheap. It'll
> probably never be possible to UNDO changes to something like a GIN
> index on a zheap table, because you can never do that with sensible
> concurrency/deadlock behavior.


I mean, essentially any well-designed framework intended for any sort
of task whatsoever is going to have a design center where one can
foresee that it will work well, and then as a result of working well
for the thing for which it was designed, it will also work well for
other things that are sufficiently similar.  So, I think you're
correct, but I also don't think that's really saying very much. The
trick is to figure out whether and how the ideas you have could be
generalized with reasonable effort to handle other cases, and that's
easier with some projects than others.  I think when it comes to UNDO,
it's actually really hard. The system has some assumptions built into
it which are probably required for good performance and reasonable
complexity, and it's probably got other assumptions in it which are
unnecessary and could be eliminated if we only realized that we were
making those assumptions in the first place. The more involvement we
get from people who aren't coming at this from the point of view of
zheap, the more likely it is that we'll be able to find those
assumptions and wipe them out before they get set in concrete.
Unfortunately, we haven't had many takers so far -- thanks for chiming
in.

I don't really understand your comments about GIN.  My simplistic
understanding of GIN is that it's not very different from btree in
this regard.  Suppose we insert a row, and then the insert aborts;
suppose also that the index wants to use UNDO.  In the case of a btree
index, we're going to go insert an index entry for the new row; upon
abort, we should undo the index insertion by removing that index tuple
or at least marking it dead.  Unless a page split has happened,
triggered either by the insertion itself or by subsequent activity,
this puts the index in a state that is almost perfectly equivalent to
where we were before the now-aborted transaction did any work.  If a
page split has occurred, trying to undo the index insertion is going
to run into two problems. One, we probably can't undo the page split,
so the index will be logically equivalent but not physically
equivalent after we get rid of the new tuple. Two, if the page split
happened after the insertion of the new tuple rather than at the same
time, the index tuple may not be on the page where we left it.
Possibly we can walk right (or left, or sideways, or diagonally at a
35 degree angle, my index-fu is not great here) and be sure of finding
it, assuming the index is not corrupt.

Now, my mental model of a GIN index is that you go find N>=0 index
keys inside each value and do basically the same thing as you would
for a btree index for each one of them. Therefore it seems to me,
possibly stupidly, that you're going to have basically the same
problems, except each problem will now potentially happen up to N
times instead of up to 1 time.  I assume here that in either case -
GIN or btree - you would tentatively record where you left the tuple
that now needs to be zapped and that you can jump to that place
directly to try to zap it. Possibly those assumptions are bad and
maybe that's where you're seeing a concurrency/deadlock problem; if
so, a more detailed explanation would be very helpful.

To me, based on my more limited knowledge of indexing, I'm not really
seeing a concurrency/deadlock issue, but I do see that there's going
to be a horrid efficiency problem if page splits are common.  Suppose
for example that you bulk-load a bunch of rows into an indexed table
in descending order according to the indexed column, with all the new
values being larger than any existing values in that column.  The
insertion point basically doesn't change: you're always inserting
after what was the original high value in the column, and that point
is always on the same page, but that page is going to be repeatedly
split, so that, at the end of the load, almost none of the
newly-inserted rows are going to be on the page into which they were
originally inserted.  Now if you abort, you're going to either have to
walk right a long way from the original insertion point to find each
tuple, or re-find each tuple by traversing from the root of the tree
instead of remembering where you left it. Doing the first for N tuples
is O(N^2), and doing the second is O(N*H) where H is the height of the
btree.  The latter is almost like O(N) given the high fanout of a
btree, but with a much higher constant factor than the
remember-where-you-put-it strategy would be in cases where no splits
have occurred. Neither seems very good.  This seems to be a very
general problem with making undo and indexes work nicely together:
almost any index type has to sometimes move tuple around to different
pages, which makes finding them a lot more expensive than re-finding a
heap tuple.

I think that most of the above is a bit of a diversion from the
original topic of the thread. I think I see the connection you're
making between the two topics: the more likely undo application is to
fail, the more worrying a hard limit is, and deadlocks are a way for
undo application to fail, and if that's likely to be common when undo
is applied to indexes, then undo failure will be common and a hard
limit is bad. However, I think the solution to that problem is
page-at-a-time undo: if foreground process needs to modify a page with
pending undo, and if the modification it wants to make can't be done
sensibly unless the undo is applied first, it should be prepared to
apply that undo itself - just for that page - rather than wait for
somebody else to get it done.  That's important not only for deadlock
avoidance - though deadlock avoidance is certainly a legitimate
concern - but also because the change might be part of some gigantic
rollback that's going to take an hour, and waiting for the undo to hit
all the other pages before it gets to this one will make users very
sad.  Assuming page-at-a-time undo is possible for all undo-using AMs,
which I believe to be more or less a requirement if you want to have
something production-grade, I don't really see what common deadlock
scenario could exist.  Either we're talking about LWLocks -- in which
case we've got a bug in the code -- or we're talking about heavyweight
locks -- in which case we're dealing with a rare scenario where undo
work is piling up behind strategically-acquired AELs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: should there be a hard-limit on the number of transactions pending undo?

Reply via email to