On Fri, Jul 19, 2019 at 7:28 PM Peter Geoghegan <p...@bowt.ie> wrote: > If I'm not mistaken, you're tacitly assuming that you'll always be > using zheap, or something sufficiently similar to zheap. It'll > probably never be possible to UNDO changes to something like a GIN > index on a zheap table, because you can never do that with sensible > concurrency/deadlock behavior.
I mean, essentially any well-designed framework intended for any sort of task whatsoever is going to have a design center where one can foresee that it will work well, and then as a result of working well for the thing for which it was designed, it will also work well for other things that are sufficiently similar. So, I think you're correct, but I also don't think that's really saying very much. The trick is to figure out whether and how the ideas you have could be generalized with reasonable effort to handle other cases, and that's easier with some projects than others. I think when it comes to UNDO, it's actually really hard. The system has some assumptions built into it which are probably required for good performance and reasonable complexity, and it's probably got other assumptions in it which are unnecessary and could be eliminated if we only realized that we were making those assumptions in the first place. The more involvement we get from people who aren't coming at this from the point of view of zheap, the more likely it is that we'll be able to find those assumptions and wipe them out before they get set in concrete. Unfortunately, we haven't had many takers so far -- thanks for chiming in. I don't really understand your comments about GIN. My simplistic understanding of GIN is that it's not very different from btree in this regard. Suppose we insert a row, and then the insert aborts; suppose also that the index wants to use UNDO. In the case of a btree index, we're going to go insert an index entry for the new row; upon abort, we should undo the index insertion by removing that index tuple or at least marking it dead. Unless a page split has happened, triggered either by the insertion itself or by subsequent activity, this puts the index in a state that is almost perfectly equivalent to where we were before the now-aborted transaction did any work. If a page split has occurred, trying to undo the index insertion is going to run into two problems. One, we probably can't undo the page split, so the index will be logically equivalent but not physically equivalent after we get rid of the new tuple. Two, if the page split happened after the insertion of the new tuple rather than at the same time, the index tuple may not be on the page where we left it. Possibly we can walk right (or left, or sideways, or diagonally at a 35 degree angle, my index-fu is not great here) and be sure of finding it, assuming the index is not corrupt. Now, my mental model of a GIN index is that you go find N>=0 index keys inside each value and do basically the same thing as you would for a btree index for each one of them. Therefore it seems to me, possibly stupidly, that you're going to have basically the same problems, except each problem will now potentially happen up to N times instead of up to 1 time. I assume here that in either case - GIN or btree - you would tentatively record where you left the tuple that now needs to be zapped and that you can jump to that place directly to try to zap it. Possibly those assumptions are bad and maybe that's where you're seeing a concurrency/deadlock problem; if so, a more detailed explanation would be very helpful. To me, based on my more limited knowledge of indexing, I'm not really seeing a concurrency/deadlock issue, but I do see that there's going to be a horrid efficiency problem if page splits are common. Suppose for example that you bulk-load a bunch of rows into an indexed table in descending order according to the indexed column, with all the new values being larger than any existing values in that column. The insertion point basically doesn't change: you're always inserting after what was the original high value in the column, and that point is always on the same page, but that page is going to be repeatedly split, so that, at the end of the load, almost none of the newly-inserted rows are going to be on the page into which they were originally inserted. Now if you abort, you're going to either have to walk right a long way from the original insertion point to find each tuple, or re-find each tuple by traversing from the root of the tree instead of remembering where you left it. Doing the first for N tuples is O(N^2), and doing the second is O(N*H) where H is the height of the btree. The latter is almost like O(N) given the high fanout of a btree, but with a much higher constant factor than the remember-where-you-put-it strategy would be in cases where no splits have occurred. Neither seems very good. This seems to be a very general problem with making undo and indexes work nicely together: almost any index type has to sometimes move tuple around to different pages, which makes finding them a lot more expensive than re-finding a heap tuple. I think that most of the above is a bit of a diversion from the original topic of the thread. I think I see the connection you're making between the two topics: the more likely undo application is to fail, the more worrying a hard limit is, and deadlocks are a way for undo application to fail, and if that's likely to be common when undo is applied to indexes, then undo failure will be common and a hard limit is bad. However, I think the solution to that problem is page-at-a-time undo: if foreground process needs to modify a page with pending undo, and if the modification it wants to make can't be done sensibly unless the undo is applied first, it should be prepared to apply that undo itself - just for that page - rather than wait for somebody else to get it done. That's important not only for deadlock avoidance - though deadlock avoidance is certainly a legitimate concern - but also because the change might be part of some gigantic rollback that's going to take an hour, and waiting for the undo to hit all the other pages before it gets to this one will make users very sad. Assuming page-at-a-time undo is possible for all undo-using AMs, which I believe to be more or less a requirement if you want to have something production-grade, I don't really see what common deadlock scenario could exist. Either we're talking about LWLocks -- in which case we've got a bug in the code -- or we're talking about heavyweight locks -- in which case we're dealing with a rare scenario where undo work is piling up behind strategically-acquired AELs. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company