The proposed fix looks good, it resolves the lock contention problem as 
intended. +1 from my side.

> On 09-Jul-2020, at 4:22 PM, Fujii Masao <masao.fu...@oss.nttdata.com> wrote:
> 
> 
> Regarding how to fix, don't we need memory barrier when reading
> sync_standbys_defined? Without that, after SyncRepUpdateSyncStandbysDefined()
> updates it to true, SyncRepWaitForLSN() can see the previous value,
> i.e., false, and then exit out of the function. Is this right?
> If this is right, we need memory barrier to avoid this issue?
> 

There is no out-of-order execution hazard in the scenario you are describing, 
memory barriers don’t seem to fit.  Using locks to synchronise checkpointer 
process and a committing backend process is the right way.  We have made a 
conscious decision to bypass the lock, which looks correct in this case.

As an aside, there is a small (?) window where a change to 
synchronous_standby_names GUC is partially propagated among committing 
backends, checkpointer and walsender.  Such a window may result in walsender 
declaring a standby as synchronous while a commit backend fails to wait for it 
in SyncRepWaitForLSN.  The root cause is walsender uses sync_standby_priority, 
a per-walsender variable to tell if a standby is synchronous.  It is updated 
when walsender processes a config change.  Whereas sync_standbys_defined, a 
variable updated by checkpointer, is used by committing backends to determine 
if they need to wait.  If checkpointer is busy flushing buffers, it may take 
longer than walsender to reflect a change in sync_standbys_defined.  This is a 
low impact problem, should be ok to live with it.

Asim

Reply via email to