On 29/11/17 20:11, Stas Kelvich wrote: > >> On 29 Nov 2017, at 18:46, Petr Jelinek <petr.jeli...@2ndquadrant.com> wrote: >> >> What I don't understand is how it leads to crash (and I could not >> reproduce it using the pgbench file attached in this thread either) and >> moreover how it leads to 0 xid being logged. The only explanation I can >> come up is that some kind of similar race has to be in >> LogStandbySnapshot() but we explicitly check for 0 xid value there. >> > > Zero xid isn’t logged. Loop in XactLockTableWait() does following: > > for (;;) > { > Assert(TransactionIdIsValid(xid)); > Assert(!TransactionIdEquals(xid, GetTopTransactionIdIfAny())); > > <...> > > xid = SubTransGetParent(xid); > } > > So if last statement is reached for top transaction then next iteration > will crash in first assert. And it will be reached if whole this loop > happens before transaction acquired heavyweight lock. > > Probability of that crash can be significantly increased be adding sleep > between xid generation and lock insertion in AssignTransactionId(). >
Yes that helps thanks. Now that I reproduced it I understand, I was confused by the backtrace that said xid was 0 on the input to XactLockTableWait() but that's not the case, it's what xid is changed to in the inner loop. So what happens is that we manage to do LogStandbySnapshot(), decode the logged snapshot, and run SnapBuildWaitSnapshot() for a transaction in between GetNewTransactionId() and XactLockTableInsert() calls in AssignTransactionId() for that same transaction. I guess the probability of this happening is increased by the fact that GetRunningTransactionData() acquires XidGenLock so if there is GetNewTransactionId() running in parallel it will wait for it to finish and we'll log immediately after that. Hmm that means that Robert's loop idea will not help and ProcArrayLock will not save us either. Maybe we could either rewrite XactLockTableWait or create another version of it with SubTransGetParent() call replaced by SubTransGetTopmostTransaction() as that will return the same top level xid in case the input xid wasn't a subxact. That would make it safe to be called on transactions that didn't acquire lock on themselves yet. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services