At Fri, 7 Jan 2022 09:44:15 -0800, SATYANARAYANA NARLAPURAM <satyanarlapu...@gmail.com> wrote in > On Fri, Jan 7, 2022 at 12:27 AM Kyotaro Horiguchi <horikyota....@gmail.com> > wrote: > > One is to serialize WAL sending (of course it is unacceptable at all) > > or aotehr is to send WAL to all standbys at once then make the > > decision after making sure receiving replies from all standbys (this > > is no longer quorum commit in another sense..) > > > > There is no need to serialize sending the WAL among sync standbys. The only > serialization required is first to all the sync replicas and then to sync > replicas if any. Once an LSN is quorum committed, no failover subsystem > initiates an automatic failover such that the LSN is lost (data loss)
Sync standbys on PostgreSQL is ex post facto. When a certain set of standbys have first reported catching-up for a commit, they are called "sync standbys". We can maintain a fixed set of sync standbys based on the set of sync-standbys at a past commits, but that implies performance degradation even if not a single standby is gone. If we send WAL only to the fixed-set of sync standbys, when any of the standbys is gone, the primary is forced to wait until some timeout expires. The same commit would finish immediately if WAL had been sent also to out-of-quorum standbys. > > So I'm afraid that there's no sensible solution to avoid the > > hiding-forerunner problem on quorum commit. > > Could you elaborate on the problem here? If a primary have received response for LSN=X from N standbys, that fact doesn't guarantee that none of the other standbys reached the same LSN. If one of the yet-unresponded standbys already reached LSN=X+10 but its response does not arrived to the primary for some reasons, the true-fastest standby is hiding from primary. Even if the primary examines the responses from all standbys, it is uncertain if the responses reflect the truly current state of the standbys. Thus if we want to guarantee that no unresponded standby is going beyond LSN=X, there's no means other than we refrain from sending WAL beyond X. In that case, we need to serialize the period from WAL-sending to response-reception, which would lead to critical performance degradation. regards. -- Kyotaro Horiguchi NTT Open Source Software Center