On Wed, Mar 9, 2022 at 7:31 AM Andres Freund <and...@anarazel.de> wrote: > > Hi, > > On 2022-03-06 12:27:52 +0530, Bharath Rupireddy wrote: > > On Sun, Mar 6, 2022 at 1:57 AM Andres Freund <and...@anarazel.de> wrote: > > > > > > Hi, > > > > > > On 2022-03-05 14:14:54 +0530, Bharath Rupireddy wrote: > > > > I understand. Even if we use the SyncRepWaitForLSN approach, the async > > > > walsenders will have to do nothing in WalSndLoop() until the sync > > > > walsender wakes them up via SyncRepWakeQueue. > > > > > > I still think we should flat out reject this approach. The proper way to > > > implement this feature is to change the protocol so that WAL can be sent > > > to > > > replicas with an additional LSN informing them up to where WAL can be > > > flushed. That way WAL is already sent when the sync replicas have > > > acknowledged > > > receipt and just an updated "flush/apply up to here" LSN has to be sent. > > > > I was having this thought back of my mind. Please help me understand these: > > 1) How will the async standbys ignore the WAL received but > > not-yet-flushed by them in case the sync standbys don't acknowledge > > flush LSN back to the primary for whatever reasons? > > What do you mean with "ignore"? When replaying?
Let me illustrate with an example: 1) Say, primary at LSN 100, sync standby at LSN 90 (is about to receive/receiving the WAL from LSN 91 - 100 from primary), async standby at LSN 100 - today this is possible if the async standby is closer to primary than sync standby for whatever reasons 2) With the approach that's originally proposed in this thread - async standbys can never get ahead of LSN 90 (flush LSN reported back to the primary by all sync standbys) 3) With the approach that's suggested i.e. "let async standbys receive WAL at their own pace, but they should only be allowed to apply/write/flush to the WAL file in pg_wal directory/disk until the sync standbys latest flush LSN" - async standbys can receive the WAL from LSN 91 - 100 but they aren't allowed to apply/write/flush. Where will the async standbys hold the WAL from LSN 91 - 100 until the latest flush LSN (100) is reported to them? If they "somehow" store the WAL from LSN 91 - 100 and not apply/write/flush, how will they ignore that WAL, say if the sync standbys don't report the latest flush LSN back to the primary (for whatever reasons)? In such cases, the primary has no idea of the latest sync standbys flush LSN (?) if at all the sync standbys can't come up and reconnect and resync with the primary? Should the async standby always assume that the WAL from LSN 91 -100 is invalid for them as they haven't received the sync flush LSN from primary? In such a case, aren't there "invalid holes" in the WAL files on the async standbys? > I think this'd require adding a new pg_control field saying up to which LSN > WAL is "valid". If that field is set, replay would only replay up to that LSN > unless some explicit operation is taken to replay further (e.g. for data > recovery). With the approach that's suggested i.e. "let async standbys receive WAL at their own pace, but they should only be allowed to apply/write/flush to the WAL file in pg_wal directory/disk until the sync standbys latest flush LSN'' - there can be 2 parts to the WAL on async standbys - most of it "valid and makes sense for async standbys" and some of it "invalid and doesn't make sense for async standbys''? Can't this require us to rework some parts like "redo/apply/recovery logic on async standbys'', tools like pg_basebackup, pg_rewind, pg_receivewal, pg_recvlogical, cascading replication etc. that depend on WAL records and now should know whether the WAL records are valid for them? I may be wrong here though. > > 2) When we say the async standbys will receive the WAL, will they just > > keep the received WAL in the shared memory but not apply or will they > > just write but not apply the WAL and flush the WAL to the pg_wal > > directory on the disk or will they write to some other temp wal > > directory until they receive go-ahead LSN from the primary? > > I was thinking that for now it'd go to disk, but eventually would first go to > wal_buffers and only to disk if wal_buffers needs to be flushed out (and only > in that case the pg_control field would need to be set). IIUC, the WAL buffers (XLogCtl->pages) aren't used on standbys as wal receivers bypass them and flush the data directly to the disk. Hence, the WAL buffers that are allocated(?, I haven't checked the code though) but unused on standbys can be used to hold the WAL until the new flush LSN is reported from the primary. At any point of time, the WAL buffers will have the latest WAL that's waiting for a new flush LSN from the primary. However, this can be a problem for larger transactions that can eat up the entire WAL buffers and flush LSN is far behind in which case we need to flush the WAL to the latest WAL file in pg_wal/disk but let the other folks in the server know upto which the WAL is valid. > > 3) Won't the network transfer cost be wasted in case the sync standbys > > don't acknowledge flush LSN back to the primary for whatever reasons? > > That should be *extremely* rare, and in that case a bit of wasted traffic > isn't going to matter. Agree. > > The proposed idea in this thread (async standbys waiting for flush LSN > > from sync standbys before sending the WAL), although it makes async > > standby slower in receiving the WAL, it doesn't have the above > > problems and is simpler to implement IMO. Since this feature is going > > to be optional with a GUC, users can enable it based on the needs. > > To me it's architecturally the completely wrong direction. We should move in > the *other* direction, i.e. allow WAL to be sent to standbys before the > primary has finished flushing it locally. Which requires similar > infrastructure to what we're discussing here. Agree. * XXX probably this should be improved to suck data directly from the * WAL buffers when possible. Like others pointed out, if done above, it's possible to achieve "allow WAL to be sent to standbys before the primary has finished flushing it locally". I would like to hear more thoughts and then summarize the design points a bit later. Regards, Bharath Rupireddy.