"Drouvot, Bertrand" <bdrou...@amazon.com> writes: > Does it make sense (as it is currently) to set the ActiveSnapshot to > NULL and not ensuring the same is done for ActivePortal->portalSnapshot?
I think that patch is just a kluge :-( After tracing through this I've figured out what is happening, and why you need to put the exception block inside a loop to make it happen. The first iteration goes fine, and we end up with an empty ActiveSnapshot stack (and no portalSnapshots) because that's how the COMMIT leaves it. In the second iteration, we try to re-establish the portal snapshot, but at the point where EnsurePortalSnapshotExists is called (from the first INSERT command) we are already inside a subtransaction thanks to the plpgsql exception block. This means that the stacked ActiveSnapshot has as_level = 2, although the Portal owning it belongs to the outer transaction. So at the next exception, AtSubAbort_Snapshot zaps the stacked ActiveSnapshot, but the Portal stays alive and now it has a dangling portalSnapshot pointer. Basically there seem to be two ways to fix this, both requiring API additions to snapmgr.c (hence, cc'ing Alvaro for opinion): 1. Provide a variant of PushActiveSnapshot that allows the caller to specify the as_level to use, and then have EnsurePortalSnapshotExists, as well as other places that create Portal-associated snapshots, use this with as_level equal to the Portal's createSubid. The main hazard I see here is that this could theoretically allow the ActiveSnapshot stack to end up with out-of-order as_level values, causing issues. For the moment we could probably just add assertions to check that the passed as_level is >= next level, or maybe even that this variant is only called with empty ActiveSnapshot stack. 2. Provide a way for AtSubAbort_Portals to detect whether a portalSnapshot pointer points to a snap that's going to be deleted by AtSubAbort_Snapshot, and then just have it clear any portalSnapshots that are about to become dangling. (This'd amount to accepting the possibility that portalSnapshot is of a different subxact level from the portal, and dealing with the situation.) I initially thought #2 would be the way to go, but it turns out to be a bit messy since what we have is a Snapshot pointer not an ActiveSnapshotElt pointer. We'd have to do something like search the ActiveSnapshot stack looking for pointer equality to the caller's Snapshot pointer, which seems fragile --- do we assume as_snap is unique for any other purpose? That being the case, I'm now leaning to #1. Thoughts? regards, tom lane