Hi, Thanks for looking into this.
On Fri, Aug 23, 2024 at 5:03 AM John H <johnh...@gmail.com> wrote: > > For a motivation aspect I can see this being useful > synchronous_replicas if you have commit set to flush mode. In synchronous replication setup, until standby finishes fetching WAL from the archive, the commits on the primary have to wait which can increase the query latency. If the standby can connect to the primary as soon as the broken connection is restored, it can fetch the WAL soon and transaction commits can continue on the primary. Is my understanding correct? Is there anything more to this? I talked to Michael Paquier at PGConf.Dev 2024 and got some concerns about this feature for dealing with changing timelines. I can't think of them right now. And, there were some cautions raised upthread - https://www.postgresql.org/message-id/20240305020452.GA3373526%40nathanxps13 and https://www.postgresql.org/message-id/ZffaQt7UbM2Q9kYh%40paquier.xyz. > So +1 on feature, easier configurability, although thinking about it > more you could probably have the restore script be smarter and provide > non-zero exit codes periodically. Interesting. Yes, the restore script has to be smarter to detect the broken connections and distinguish whether the server is performing just the archive recovery/PITR or streaming from standby. Not doing it right, perhaps, can cause data loss (?). > The patch needs to be rebased but I tested this against an older 17 build. Will rebase soon. > > + ereport(DEBUG1, > > + errmsg_internal("switched WAL source from %s to %s after %s", > > + xlogSourceNames[oldSource], > > Not sure if you're intentionally changing from DEBUG1 from DEBUG2. Will change. > > * standby and increase the replication lag on primary. > > Do you mean "increase replication lag on standby"? > nit: reading from archive *could* be faster since you could in theory > it's not single-processed/threaded. Yes. I think we can just say "All of these can impact the recovery performance on + * standby and increase the replication lag." > > However, > > + * exhaust all the WAL present in pg_wal before switching. If successful, > > + * the state machine moves to XLOG_FROM_STREAM state, otherwise it falls > > + * back to XLOG_FROM_ARCHIVE state. > > I think I'm missing how this happens. Or what "successful" means. If I'm > reading > it right, no matter what happens we will always move to > XLOG_FROM_STREAM based on how > the state machine works? Please have a look at some discussion upthread on exhausting pg_wal before switching - https://www.postgresql.org/message-id/20230119005014.GA3838170%40nathanxps13. Even today, the standby exhausts pg_wal before switching to streaming from the archive. > I tested this in a basic RR setup without replication slots (e.g. log > shipping) where the > WAL is available in the archive but the primary always has the WAL > rotated out and > 'streaming_replication_retry_interval = 1'. This leads the RR to > become stuck where it stops fetching from > archive and loops between XLOG_FROM_PG_WAL and XLOG_FROM_STREAM. Nice catch. This is a problem. One idea is to disable streaming_replication_retry_interval feature for slot-less streaming replication - either when primary_slot_name isn't specified disallow the GUC to be set in assign_hook or when deciding to switch the wal source. Thoughts? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com