Re: [HACKERS] Issues with logical replication

Petr Jelinek Wed, 29 Nov 2017 15:26:41 -0800

On 29/11/17 20:11, Stas Kelvich wrote:
> 
>> On 29 Nov 2017, at 18:46, Petr Jelinek <petr.jeli...@2ndquadrant.com> wrote:
>>
>> What I don't understand is how it leads to crash (and I could not
>> reproduce it using the pgbench file attached in this thread either) and
>> moreover how it leads to 0 xid being logged. The only explanation I can
>> come up is that some kind of similar race has to be in
>> LogStandbySnapshot() but we explicitly check for 0 xid value there.
>>
> 
> Zero xid isn’t logged. Loop in XactLockTableWait() does following:
> 
>     for (;;)
>     {
>         Assert(TransactionIdIsValid(xid));
>         Assert(!TransactionIdEquals(xid, GetTopTransactionIdIfAny()));
> 
>       <...>
>       
>         xid = SubTransGetParent(xid);
>     }
> 
> So if last statement is reached for top transaction then next iteration
> will crash in first assert. And it will be reached if whole this loop
> happens before transaction acquired heavyweight lock.
> 
> Probability of that crash can be significantly increased be adding sleep
> between xid generation and lock insertion in AssignTransactionId().
>


Yes that helps thanks. Now that I reproduced it I understand, I was
confused by the backtrace that said xid was 0 on the input to
XactLockTableWait() but that's not the case, it's what xid is changed to
in the inner loop.

So what happens is that we manage to do LogStandbySnapshot(), decode the
logged snapshot, and run SnapBuildWaitSnapshot() for a transaction in
between GetNewTransactionId() and XactLockTableInsert() calls in
AssignTransactionId() for that same transaction.

I guess the probability of this happening is increased by the fact that
GetRunningTransactionData() acquires XidGenLock so if there is
GetNewTransactionId() running in parallel it will wait for it to finish
and we'll log immediately after that.

Hmm that means that Robert's loop idea will not help and ProcArrayLock
will not save us either. Maybe we could either rewrite XactLockTableWait
or create another version of it with SubTransGetParent() call replaced
by SubTransGetTopmostTransaction() as that will return the same top
level xid in case the input xid wasn't a subxact. That would make it
safe to be called on transactions that didn't acquire lock on themselves
yet.

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Issues with logical replication

Reply via email to