On Thu, Oct 6, 2022 at 2:30 AM Bruce Momjian <br...@momjian.us> wrote: > > As I highlighted above, by default you notify the administrator that a > sychronous replica is not responding and then ignore it. If it becomes > responsive again, you notify the administrator again and add it back as > a sychronous replica. > > > command in any form may pose security risks. I'm not sure at this > > point how this new timeout is going to work alongside > > wal_sender_timeout. > > We have archive_command, so I don't see a problem with another shell > command.
Why do we need a new command to inform the admin/user about a sync replication being ignored (from sync quorum) for not responding or acknowledging for a certain amount of time in SyncRepWaitForLSN()? Can't we just add an extra column or use existing sync_state in pg_stat_replication()? We can either introduce a new state such as temporary_async or just use the existing state 'potential' [1]. A problem is that the server has to be monitored for this extra, new state. If we do this, we don't need another command to report. > > I'm thinking about the possible options that an admin has to get out > > of this situation: > > 1) Removing the standby from synchronous_standby_names. > > Yes, see above. We might need a read-only GUC that reports which > sychronous replicas are active. As you can see, there is a lot of API > design required here, but this is the most effective approach. If we use the above approach to report via pg_stat_replication(), we don't need this. > > > Once we have that, we can consider removing the cancel ability while > > > waiting for synchronous replicas (since we have the timeout) or make it > > > optional. We can also consider how do notify the administrator during > > > query cancel (if we allow it), backend abrupt exit/crash, and > > > > Yeah. If we have the > > timeout-and-auto-removal-of-standby-from-sync-standbys-list solution, > > the users can then choose to disable processing query cancels/proc > > dies while waiting for sync replication in SyncRepWaitForLSN(). > > Yes. We might also change things so a query cancel that happens during > sychronous replica waiting can only be done by an administrator, not the > session owner. Again, lots of design needed here. Yes, we need infrastructure to track who issued the query cancel or proc die and so on. IMO, it's not a good way to allow/disallow query cancels or CTRL+C based on role types - superusers or users with replication roles or users who are members of any of predefined roles. In general, it is the walsender serving sync standby that has to mark itself as async standby by removing itself from synchronous_standby_names, reloading config variables and waking up the backends that are waiting in syncrep wait queue for it to update LSN. And, the new auto removal timeout should always be set to less than wal_sender_timeout. All that said, imagine we have timeout-and-auto-removal-of-standby-from-sync-standbys-list solution in one or the other forms with auto removal timeout set to 5 minutes, any of following can happen: 1) query is stuck waiting for sync standby ack in SyncRepWaitForLSN(), no query cancel or proc die interrupt is arrived, the sync standby is made as async standy after the timeout i.e. 5 minutes. 2) query is stuck waiting for sync standby ack in SyncRepWaitForLSN(), say for about 3 minutes, then query cancel or proc die interrupt is arrived, should we immediately process it or wait for timeout to happen (2 more minutes) and then process the interrupt? If we immediately process the interrupts, then the locally-committed-but-not-replicated-to-sync-standby problems described upthread [2] are left unresolved. [1] https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-REPLICATION-VIEW sync_state text Synchronous state of this standby server. Possible values are: async: This standby server is asynchronous. potential: This standby server is now asynchronous, but can potentially become synchronous if one of current synchronous ones fails. sync: This standby server is synchronous. quorum: This standby server is considered as a candidate for quorum standbys. [2] https://www.postgresql.org/message-id/CALj2ACXmMWtpmuT-%3Dv8F%2BLk4QCbdkeN%2ByHKXeRGKFfjG96YbKA%40mail.gmail.com -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com