Awhile back, Alvaro Herrera wrote: >> Pushed to all affected branches, along with a somewhat lame >> isolationtester test for the condition (since we've already broken this >> twice and not noticed for long).
> Buildfarm member okapi just failed this test in 9.4: okapi has continued to fail that test, not 100% of the time but much more often than not ... but only in 9.4. And no other animals have shown it at all. So what to make of that? Noting that okapi uses a pretty old icc version running at a high -O level, we could dismiss it as probably-a-compiler-bug. But that theory doesn't really account for the fact that it sometimes succeeds. Another theory, noting that 9.5 and later have memory barriers in S_UNLOCK which 9.4 lacks, is that the reason 9.4 has a problem is lack of a memory barrier between SnapshotResetXmin and GetCurrentVirtualXIDs, thus allowing both processes to observe the other's xmin as still nonzero given the right timing. This seems like a stretch, because really the latter function's LWLockAcquire on ProcArrayLock ought to be enough to serialize things. But there has to be *something* different between 9.4 and all the later branches, and the barrier stuff sure looks like it's in the right neighborhood. As an investigative measure, I propose that we insert Assert(MyPgXact->xmin == InvalidTransactionId); into 9.4's DefineIndex, just after its InvalidateCatalogSnapshot call. I don't want to leave that there permanently, because it's not clear to me that there are no legitimate cases where a backend wouldn't have extra snapshots active during CREATE INDEX CONCURRENTLY --- but we seem to get through 9.4's regression tests with it, and it would quickly confirm or deny whether okapi is failing because it somehow has an extra snapshot. Assuming that that doesn't show anything, I'm inclined to think that the next step should be to add a pg_memory_barrier() call to SnapshotResetXmin (again only in the 9.4 branch), and see if that helps. regards, tom lane