Andreas Seltenreich <seltenre...@gmx.de> writes: > Tom Lane writes: >> I wonder if Andreas would be interested in trying the randomly-timed- >> SIGTERM thing with sqlsmith.
> So far, most of the core dumps generated are Jeevan's assertion failing > with backtraces through SearchCatCacheList. The rest is failing this > assertion: > TRAP: FailedAssertion("!(portal->cleanup == ((void *)0))", File: > "portalmem.c", Line: 846) > Example backtrace below. They all happened during a rollback statement. > Testing was done on master at 2336f84284. Interesting. I can reproduce this by forcing a FATAL exit right there, eg by adding if (pstmt->utilityStmt && IsA(pstmt->utilityStmt, TransactionStmt) && ((TransactionStmt *) pstmt->utilityStmt)->kind == TRANS_STMT_ROLLBACK) InterruptPending = ProcDiePending = true; before PortalRunMulti's CHECK_FOR_INTERRUPTS. But it only fails if the rollback is trying to exit a transaction that's already suffered an error. The explanation seems to be: 1. Because we already aborted the transaction, xact.c is in state blockState == TBLOCK_ABORT. 2. This causes AbortOutOfAnyTransaction to think that it can skip doing AbortTransaction() and go straight to CleanupTransaction(). 3. AtCleanup_Portals() is expecting that we already ran, and cleared, the portal cleanup hook (PortalCleanup) for every live portal. But we have not done so for the active portal that we were in the midst of running ROLLBACK in. So it fails the mentioned assertion. Thus, the basic problem is that we get to CleanupTransaction() without having done PortalDrop on the portal we're running ROLLBACK in. We could take this as an argument for simply removing that assertion, which would mean that when AtCleanup_Portals calls PortalDrop, the cleanup hook would get run, the same as it would have if exec_simple_query had had a chance to drop the portal. However, I'm still pretty afraid of allowing arbitrary code to execute during CleanupTransaction(). What seems like a better idea is to make AtCleanup_Portals just summarily clear the cleanup hook, perhaps while emitting a warning. I tried that (see portalmem.c changes in attached patch), and what I got was this: regression=# rollback; WARNING: skipping cleanup for portal "" FATAL: terminating connection due to administrator command FATAL: cannot drop active portal "" server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. or in the server log: 2017-08-13 13:49:50.312 EDT [8737] FATAL: terminating connection due to administrator command 2017-08-13 13:49:50.312 EDT [8737] STATEMENT: rollback; 2017-08-13 13:49:50.312 EDT [8737] WARNING: skipping cleanup for portal "" 2017-08-13 13:49:50.312 EDT [8737] FATAL: cannot drop active portal "" Well, we could tolerate that maybe, but it doesn't seem like it's actually acting nicely; the active portal is still creating problems for us. After some further thought, I decided to try making AbortOutOfAnyTransaction call AtAbort_Portals() in this situation, thus basically doing the minimum part of AbortTransaction() needed to clean up the active portal. That almost worked --- my initial try got an assertion failure in mcxt.c, because somebody was trying to drop the CurrentMemoryContext. So the minimum part of AbortTransaction that we need here is really AtAbort_Memory + AtAbort_Portals. After further thought I decided it'd be a good idea to phrase that as an unconditional AtAbort_Memory at the top of AbortOutOfAnyTransaction, thus making sure we are in some valid context to start with; and then, in case the loop itself doesn't have anything to do, we need AtCleanup_Memory() at the bottom of the function to revert CurrentMemoryContext to TopMemoryContext. In short, the attached patch seems to fix it nicely. We definitely need the xact.c changes, but the ones in portalmem.c are perhaps optional, as in theory now the assertion will never be violated. But I'm inclined to keep the portalmem changes anyway, as that will make it more robust if the situation ever happens in a non-assert build. Note: there are a couple of comments elsewhere in portalmem.c that need adjustment if we keep the portalmem changes. I have not bothered to do that yet, as it's just cosmetic. Comments? regards, tom lane
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 50c3c3b..1eabc67 100644 *** a/src/backend/access/transam/xact.c --- b/src/backend/access/transam/xact.c *************** AbortOutOfAnyTransaction(void) *** 4233,4238 **** --- 4233,4241 ---- { TransactionState s = CurrentTransactionState; + /* Ensure we're not running in a doomed memory context */ + AtAbort_Memory(); + /* * Get out of any transaction or nested transaction */ *************** AbortOutOfAnyTransaction(void) *** 4274,4280 **** break; case TBLOCK_ABORT: case TBLOCK_ABORT_END: ! /* AbortTransaction already done, still need Cleanup */ CleanupTransaction(); s->blockState = TBLOCK_DEFAULT; break; --- 4277,4290 ---- break; case TBLOCK_ABORT: case TBLOCK_ABORT_END: ! ! /* ! * AbortTransaction is already done, still need Cleanup. ! * However, if we failed partway through running ROLLBACK, ! * there will be an active portal running that command, which ! * we need to shut down before doing CleanupTransaction. ! */ ! AtAbort_Portals(); CleanupTransaction(); s->blockState = TBLOCK_DEFAULT; break; *************** AbortOutOfAnyTransaction(void) *** 4297,4302 **** --- 4307,4320 ---- case TBLOCK_SUBABORT_END: case TBLOCK_SUBABORT_RESTART: /* As above, but AbortSubTransaction already done */ + if (s->curTransactionOwner) + { + /* As in TBLOCK_ABORT, might have a live portal to zap */ + AtSubAbort_Portals(s->subTransactionId, + s->parent->subTransactionId, + s->curTransactionOwner, + s->parent->curTransactionOwner); + } CleanupSubTransaction(); s = CurrentTransactionState; /* changed by pop */ break; *************** AbortOutOfAnyTransaction(void) *** 4305,4310 **** --- 4323,4331 ---- /* Should be out of all subxacts now */ Assert(s->parent == NULL); + + /* If we didn't actually have anything to do, revert to TopMemoryContext */ + AtCleanup_Memory(); } /* diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c index 5983aed..e3a3526 100644 *** a/src/backend/utils/mmgr/portalmem.c --- b/src/backend/utils/mmgr/portalmem.c *************** AtCleanup_Portals(void) *** 842,849 **** if (portal->portalPinned) portal->portalPinned = false; ! /* We had better not be calling any user-defined code here */ ! Assert(portal->cleanup == NULL); /* Zap it. */ PortalDrop(portal, false); --- 842,856 ---- if (portal->portalPinned) portal->portalPinned = false; ! /* ! * We had better not call any user-defined code during cleanup, so if ! * the cleanup hook hasn't been run yet, too bad. ! */ ! if (PointerIsValid(portal->cleanup)) ! { ! elog(WARNING, "skipping cleanup for portal \"%s\"", portal->name); ! portal->cleanup = NULL; ! } /* Zap it. */ PortalDrop(portal, false); *************** AtSubCleanup_Portals(SubTransactionId my *** 1026,1033 **** if (portal->portalPinned) portal->portalPinned = false; ! /* We had better not be calling any user-defined code here */ ! Assert(portal->cleanup == NULL); /* Zap it. */ PortalDrop(portal, false); --- 1033,1047 ---- if (portal->portalPinned) portal->portalPinned = false; ! /* ! * We had better not call any user-defined code during cleanup, so if ! * the cleanup hook hasn't been run yet, too bad. ! */ ! if (PointerIsValid(portal->cleanup)) ! { ! elog(WARNING, "skipping cleanup for portal \"%s\"", portal->name); ! portal->cleanup = NULL; ! } /* Zap it. */ PortalDrop(portal, false);
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers