Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

Bharath Rupireddy Thu, 04 Aug 2022 01:32:10 -0700

On Thu, Aug 4, 2022 at 1:42 PM Bharath Rupireddy
<bharath.rupireddyforpostg...@gmail.com> wrote:
>
> On Mon, Jul 25, 2022 at 4:20 PM Andrey Borodin <x4...@yandex-team.ru> wrote:
> >
> > > 25 июля 2022 г., в 14:29, Bharath Rupireddy 
> > > <bharath.rupireddyforpostg...@gmail.com> написал(а):
> > >
> > > Hm, after thinking for a while, I tend to agree with the above
> > > approach - meaning, query cancel interrupt processing can completely
> > > be disabled in SyncRepWaitForLSN() and process proc die interrupt
> > > immediately, this approach requires no GUC as opposed to the proposed
> > > v1 patch upthread.
> > GUC was proposed here[0] to maintain compatibility with previous behaviour. 
> > But I think that having no GUC here is fine too. If we do not allow 
> > cancelation of unreplicated backends, of course.
> >
> > >>
> > >> And yes, we need additional complexity - but in some other place. 
> > >> Transaction can also be locally committed in presence of a server crash. 
> > >> But this another difficult problem. Crashed server must not allow data 
> > >> queries until LSN of timeline end is successfully replicated to 
> > >> synchronous_standby_names.
> > >
> > > Hm, that needs to be done anyways. How about doing as proposed
> > > initially upthread [1]? Also, quoting the idea here [2].
> > >
> > > Thoughts?
> > >
> > > [1] 
> > > https://www.postgresql.org/message-id/CALj2ACUrOB59QaE6=jf2cfayv1mr7fzd8tr4ym5+oweyg1s...@mail.gmail.com
> > > [2] 2) Wait for sync standbys to catch up upon restart after the crash or
> > > in the next txn after the old locally committed txn was canceled. One
> > > way to achieve this is to let the backend, that's making the first
> > > connection, wait for sync standbys to catch up in ClientAuthentication
> > > right after successful authentication. However, I'm not sure this is
> > > the best way to do it at this point.
> >
> >
> > I think ideally startup process should not allow read only connections in 
> > CheckRecoveryConsistency() until WAL is not replicated to quorum al least 
> > up until new timeline LSN.
>
> We can't do it in CheckRecoveryConsistency() unless I'm missing
> something. Because, the walsenders (required for sending the remaining
> WAL to sync standbys to achieve quorum) can only be started after the
> server reaches a consistent state, after all walsenders are
> specialized backends.


Continuing on the above thought (I inadvertently clicked the send
button previously): A simple approach would be to check for quorum in
PostgresMain() before entering the query loop for (;;) for
non-walsender cases. A disadvantage of this would be that all the
backends will be waiting here in the worst case if it takes time for
achieving the sync quorum after restart -  roughly we can do the
following in PostgresMain(), of course we need locking mechanism so
that all the backends whoever reaches here will wait for the same lsn:

if (sync_replicaion_defined == true &&
shmem->wait_for_sync_repl_upon_restart == true)
{
      SyncRepWaitForLSN(pg_current_wal_flush_lsn(), false);
      shmem->wait_for_sync_repl_upon_restart = false;
}

Thoughts?

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

Reply via email to