Justin Pitts <justinpi...@gmail.com> writes: > It went seriously heisenbug since I added %p. It took all day to crash it > since I added that to the configuration. Except this latest crash is > completely different - signal 6 ABRT, different place.
Specifically: 2010-01-13 01:50:42 UTC 5455 ERROR: snapshot reference 0xba6ac0 is not owned by resource owner TopTransaction 2010-01-13 01:50:42 UTC 5455 STATEMENT: COMMIT 2010-01-13 01:50:42 UTC 5455 WARNING: AbortTransaction while in COMMIT state 2010-01-13 01:50:42 UTC 5455 PANIC: cannot abort transaction 777554, it was already committed This looks different, but it matches what I had concluded from the previous dumps: snapmgr.c thinks it has a valid serializable snapshot, but it doesn't. In your previous crashes, CurrentSnaphot was pointing at trashed memory; here, it's still a valid-looking snapshot, but it's not registered in the TopTransactionResourceOwner like it's supposed to be (it wasn't in the other cases either). After puzzling over this for many hours, I have a theory that seems to fit the facts. What I think must be happening is that GetTransactionSnapshot is getting called outside any transaction --- that is, after AtEOXact_Snapshot but before CleanupTransaction (thus, the old TopTransactionResourceOwner is still there, so its attempt to register the snap doesn't dump core). Your build isn't assert-enabled so the extra snapshot registration wouldn't be noticed by the eventual ResourceOwnerDelete. Then, when we next start a transaction and request a serializable snapshot, GetTransactionSnapshot thinks it has nothing to do and happily hands back a pointer to garbage. Now, how the heck did GetTransactionSnapshot get called that way? Your log shows that you're using client software that likes to abort failed transactions with a previously prepared statement: 2010-01-13 01:50:41 UTC 5455 LOG: execute S_2: ROLLBACK This means that to get to CleanupTransaction, we have to look up a cached plan while in aborted transaction state. And that will call RevalidateCachedPlan, which does this if the plan is marked dead: /* * If a snapshot is already set (the normal case), we can just use * that for parsing/planning. But if it isn't, install one. Note: no * point in checking whether parse analysis requires a snapshot; * utility commands don't have invalidatable plans, so we'd not get * here for such a command. */ if (!ActiveSnapshotSet()) { PushActiveSnapshot(GetTransactionSnapshot()); snapshot_set = true; } As the comment indicates, it should not be possible to get here for a ROLLBACK plan. But look what got added in 8.4: ResetPlanCache(), which marks plans dead *unconditionally*. And one of the ways that that can get invoked is via a SI queue overrun event, which triggers a general cache flush and in particular calls PlanCacheSysCallback(). So what is needed to make this scenario happen is a SI queue overrun, which isn't a common event unless your application does a lot of catalog manipulations. I suppose that the reason your reporting query manages to tickle it is that it creates/deletes temp tables. And even then, you'd need an actual failure on the side receiving the overrun notification. In the cases at hand that failure is 2010-01-13 01:50:41 UTC 5455 ERROR: could not serialize access due to concurrent update which doesn't happen real often in your app, but does happen. So this theory explains what I found in the core dumps, which is otherwise dang hard to explain (I spent a lot of time looking for code paths that could bypass AtEOXact_Snapshot, but there aren't any). It explains why the failure is uncommon. It explains why the failures come in bunches (since a SI overrun is broadcast, and would thereby poison all active sessions with "dead" ROLLBACK plans; and the subsequent serialization errors could easily come in bunches too). It explains why we haven't seen similar stuff in testing --- preparing something as trivial as a ROLLBACK is pretty silly so nobody does it. And it explains why we didn't see it before 8.4 --- the bug was in fact introduced here: http://archives.postgresql.org/pgsql-committers/2008-09/msg00084.php So even though I haven't actually reproduced the problem, I have a fair amount of confidence in the explanation. We could perhaps fix this by changing the above-quoted code, but I'm inclined to teach ResetPlanCache that utility statements should never be marked dead. They don't have plans in the normal sense of the word so there's no point in forcing a revalidation cycle. I'm kinda beat right now but will work on a patch tomorrow. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs