abort-time portal cleanup

Robert Haas Thu, 12 Sep 2019 13:43:17 -0700

I've been studying At{Sub,}{Abort,Cleanup}_Portals() for the last few
days and have come to the conclusion that the code is not entirely up
to our usual standards. I believe that a good deal of the reason for
this is attributable to the poor quality of the code comments in this
area, although there are perhaps some other contributing factors as
well, such as bullheadedness on my part and perhaps others.


The trouble starts with the header comment for AtAbort_Portals, which
states that "At this point we run the cleanup hook if present, but we
can't release the portal's memory until the cleanup call." At the time
this logic was introduced in commit
de28dc9a04c4df5d711815b7a518501b43535a26 (2003-05-02),
AtAbort_Portals() affected all non-held portals without caring whether
they were active or not, and, UserAbortTransactionBlock() called
AbortTransaction() directly, so typing "ROLLBACK;" would cause
AtAbort_Portals() to be reached from within PortalRun().  Even if
PortalRun() managed to return without crashing, the caller would next
try to call PortalDrop() on what was now an invalid pointer. However,
commit 8f9f1986034a2273e09ad10671e10d1adda21d1f (2004-09-16) changed
things so that UserAbortEndTransaction() just sets things up so that
the subsequent call to CommitTransactionCommand() would call
AbortTransaction() instead of trying to do it right away, and that
doesn't happen until after we're done with the portal.  As far as I
can see, that change made this comment mostly false, but the comment
has nevertheless managed to survive for another ~15 years. I think we
can, and in fact should, just drop the portal right here.

As far as actually making that work, there are a few wrinkles. The
first is that we might be in the middle of FATAL. In that case, unlike
the ROLLBACK case, a call to PortalRun() is still on the stack, but
we'll exit the process rather than returning, so the fact that we've
created a dangling pointer for the caller won't matter. However, as
shown by commit ad9a274778d2d88c46b90309212b92ee7fdf9afe (2018-02-01)
and the report that led up to it at
https://www.postgresql.org/message-id/20180128034531.h6o4w3727ifof3jy%40alap3.anarazel.de
it's not a good idea to try to clean up the portal in that case,
because we might've already started shutting down critical systems.
It seems not only risky but also unnecessary: our process-local state
is about to go away, and the executor shouldn't need to clean up any
shared memory state that won't also get cleaned up by some other
mechanism. So, it seems to me that if we reach AtAbort_Portals()
during FATAL processing, we should either (1) do nothing at all and
just return or (2) forget about all the existing portals without
cleaning them up and then return.  The second option seems a little
safer to me, because it guarantees that if we somehow reach code that
might try to look up a portal later, it won't find anything. But I
think it's arguable.

The second wrinkle is that there might be an active portal.  Apart
from the FATAL case already mentioned, I think the only way this can
happen is some C code that calls purposefully calls AbortTransaction()
in the middle of executing a command.  It can't be an ERROR, because
then the portal would be marked as failed before we get here, and it
can't be an explicit ROLLBACK, because as noted above, that case was
changed 15 years ago.  It's got to be some other case where C code
calls AbortTransaction() voluntarily in the middle of a statement. For
over a decade, there were no cases at all of this type, but the code
in this function catered to hypothetical cases by marking the portal
failed. By 2016, Noah had figured out that this was bogus, and that
any future cases would likely require different handling, but Tom and
I shouted him down:

http://postgr.es/m/67674.1454259...@sss.pgh.pa.us

The end result of that discussion was commit
41baee7a9312eefb315b6b2973ac058c9efaa9cf (2016-02-05) which left the
code as it was but added comments nothing that it was wrong. It
actually wasn't entirely wrong, because it handled the FATAL case
mentioned above by the byzantine mechanism of invoking the portal's
cleanup callback after first setting the status to PORTAL_FAILED.
Since the only existing cleanup callback arranges to do nothing if the
status is PORTAL_FAILED, this worked out to a complicated way of
(correctly) skipping callback in the FATAL case.

But, probably because that wasn't documented in any understandable
way, possibly because nobody really understood it, when commit
8561e4840c81f7e345be2df170839846814fa004 (2018-01-22) added support
for transaction control in procedures, it just removed the code
marking active portals as failed, just as Noah had predicted would be
necessary ~2 years earlier. Andres noticed that this broke the FATAL
case and tracked it back to the removal of this code, resulting it it
getting put back, but just for the FATAL case, in commit
ad9a274778d2d88c46b90309212b92ee7fdf9afe (2018-02-01). See also
discussion at:

https://www.postgresql.org/message-id/20180128034531.h6o4w3727ifof3jy%40alap3.anarazel.de

I think that the code here still isn't really right.  Apart from the
fact that the comments don't explain anything very clearly and the
FATAL case is handled in a way that looks overly complicated and
accidental, the handling in AtSubAbort_Portals() hasn't been updated,
so it's now inconsistent with AtAbort_Portals() as to both substance
and comments. I think that's only held together because stored
procedures don't yet support explicit control over subtransactions,
only top-level transactions.

Stepping back a bit, stored procedures are a good example of a command
that uses multiple transactions internally. We have a few others, such
as VACUUM, but at present, that case only commits transactions
internally; it does not roll them back internally.  If it did, it
would want the same thing that procedures want, namely, to leave the
active portal alone. It doesn't quite work to leave the active portal
completely alone, because the portal has got a pointer to a
ResourceOwner which is about to be freed by end-of-transaction
cleanup; at the least, we've got to clear the pointer to that when the
multi-transaction statement eventually finishes in some future
transaction, it doesn't try to do anything with a dangling pointer.
But note that the resource owner can't really be in use in this
situation anyway; if it is, then the transaction abort is about to
blow away resources that the statement still needs. Similarly, the
statement can't be using any transaction-local memory, because that
too is about to get blown away. The only thing that can work at all
here is a statement that's been carefully designed to be able to
survive starting and ending transactions internally. Such a statement
must not rely on any transaction-local resources. The only thing I'm
sure we have to do is set portal->resowner to NULL.  Calling the
cleanup hook, as the current code does, can't be right, because we'd
be cleaning up something that isn't going away. I think it only works
now because this is another case where the cleanup hook arranges to do
nothing in the cases where calling it is wrong in the first place. The
current code also calls PortalReleaseCachedPlan in this case; I'm not
100% certain whether that's appropriate or not.

Attached is a patch that (1) removes At(Sub)Cleanup_Portals() entirely
in favor of dropping portals on the spot in At(Sub)Abort_Portals(),
(2) overhauls the comments in this area, and (3) makes
AtSubAbort_Portals() consistent with AtAbort_Portals(). One possible
concern here - which Andres mentioned to me during off-list discussion
- is that someone might create a portal for a ROLLBACK statement, and
then try to execute that portal after an error has occurred. In
theory, keeping the portal around between abort time and cleanup time
might allow this to work, but it actually doesn't work, so there's no
functional regression. As noted by Amit Kapila also during off-list
discussion, IsTransactionStmtList() only works if portal->stmts is
set, and PortalReleaseCachedPlan() clears portal->stmts, so once we've
aborted the transaction, any previously-created portals are
unrecognizable as transaction statements.

This area is incredibly confusing to understand, so it's entirely
possible that I've gotten some things wrong. However, I think it's
worth the effort to try to do some cleanup here, because I think it's
overly complex and under-documented. On top of the commits already
mentioned, some of which I think demonstrate that the issues here
haven't been completely understood, I found other bug fix commits that
look like bugs that might've never happened in the first place if this
weren't all so confusing.

Feedback appreciated,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

0001-Improve-handling-of-portals-after-sub-transaction-ab.patch
Description: Binary data

abort-time portal cleanup

Reply via email to