Re: Problem with synchronous replication

2019-10-30 Thread lingce . ldm
On Oct 29, 2019, at 18:50, Kyotaro Horiguchi mailto:horikyota@gmail.com>> wrote:
> 
> Hello.
> 
> At Fri, 25 Oct 2019 15:18:34 +0800, "Dongming Liu" 
> mailto:lingce@alibaba-inc.com>> wrote in 
>> 
>> Hi,
>> 
>> I recently discovered two possible bugs about synchronous replication.
>> 
>> 1. SyncRepCleanupAtProcExit may delete an element that has been deleted
>> SyncRepCleanupAtProcExit first checks whether the queue is detached, if it 
>> is not detached, 
>> acquires the SyncRepLock lock and deletes it. If this element has been 
>> deleted by walsender, 
>> it will be deleted repeatedly, SHMQueueDelete will core with a segment 
>> fault. 
>> 
>> IMO, like SyncRepCancelWait, we should lock the SyncRepLock first and then 
>> check
>> whether the queue is detached or not.
> 
> I think you're right here.

Thanks.

> 
>> 2. SyncRepWaitForLSN may not call SyncRepCancelWait if ereport check one 
>> interrupt.
>> For SyncRepWaitForLSN, if a query cancel interrupt arrives, we just 
>> terminate the wait 
>> with suitable warning. As follows:
>> 
>> a. set QueryCancelPending to false
>> b. errport outputs one warning
>> c. calls SyncRepCancelWait to delete one element from the queue
>> 
>> If another cancel interrupt arrives when we are outputting warning at step 
>> b, the errfinish
>> will call CHECK_FOR_INTERRUPTS that will output an ERROR, such as "canceling 
>> autovacuum
>> task", then the process will jump to the sigsetjmp. Unfortunately, the step 
>> c will be skipped
>> and the element that should be deleted by SyncRepCancelWait is remained.
>> 
>> The easiest way to fix this is to swap the order of step b and step c. On 
>> the other hand, 
>> let sigsetjmp clean up the queue may also be a good choice. What do you 
>> think?
>> 
>> Attached the patch, any feedback is greatly appreciated.
> 
> This is not right. It is in transaction commit so it is in a
> HOLD_INTERRUPTS section. ProcessInterrupt does not respond to
> cancel/die interrupts thus the ereport should return.

I read the relevant code, you are right. I called SyncRepWaitForLSN somewhere 
else, 
but forgot to put it in a HOLD_INTERRUPTS and  triggered an exception.

regards.

—
Dongming Liu

smime.p7s
Description: S/MIME cryptographic signature


Re: Problem with synchronous replication

2019-10-30 Thread lingce . ldm
On Oct 30, 2019, at 09:45, Michael Paquier mailto:mich...@paquier.xyz>> wrote:
> 
> On Tue, Oct 29, 2019 at 07:50:01PM +0900, Kyotaro Horiguchi wrote:
>> At Fri, 25 Oct 2019 15:18:34 +0800, "Dongming Liu" 
>>  wrote in 
>>> I recently discovered two possible bugs about synchronous replication.
>>> 
>>> 1. SyncRepCleanupAtProcExit may delete an element that has been deleted
>>> SyncRepCleanupAtProcExit first checks whether the queue is detached, if it 
>>> is not detached, 
>>> acquires the SyncRepLock lock and deletes it. If this element has been 
>>> deleted by walsender, 
>>> it will be deleted repeatedly, SHMQueueDelete will core with a segment 
>>> fault. 
>>> 
>>> IMO, like SyncRepCancelWait, we should lock the SyncRepLock first and then 
>>> check
>>> whether the queue is detached or not.
>> 
>> I think you're right here.
> 
> Indeed.  Looking at the surroundings we expect some code paths to hold
> SyncRepLock in exclusive or shared mode but we don't actually check
> that the lock is hold.  So let's add some assertions while on it.
> 
>> This is not right. It is in transaction commit so it is in a
>> HOLD_INTERRUPTS section. ProcessInterrupt does not respond to
>> cancel/die interrupts thus the ereport should return.
> 
> Yeah.  There is an easy way to check after that: InterruptHoldoffCount
> needs to be strictly positive.
> 
> My suggestions are attached.  Any thoughts?

Thanks, this patch looks good to me.

smime.p7s
Description: S/MIME cryptographic signature