Re: Race condition with restore_command on streaming replica

Leo Jin Wed, 18 Nov 2020 05:44:45 -0800

Hi Brad,

We did some research on this at EDB and here is a summary of the finding
and suggested next steps.

*Context:*

*Given* a 3 node cluster M0, M1 and M2.
*And *M0 is the primary with M1 and M2 as the replicas - we will call
this *"T0"
(Timeline 0).*
*And* a *"switch over"* command is issued to terminate M0 and promote M1 as
the new primary.

*Expected Behavior:*

*When* M0 is terminated and M1 is promoted to primary
*Then* a *".history"* file would be written to a *shared archive repo (S3) *to
indicate *T0* has ended and *T1 *is created with m1 as the new primary.
*And* *"restore_command"* is executed on the replica M2 to replay any
remaining "wal" files from T0 as well as switch to streaming from *M1* as
the new primary henceforth.

*IBM's Problem:*

*Given* the "switch over" process is run asynchronously.
*When* "restore_comand" is run on *M2* before *M1* creates a .*history* file
*Then* M2 will understand this as - *No new timeline is created, I'll wait
for the next wal file from T0.*
*But* the next wal file from T0 will *NEVER* come because it's been
terminated and M2 hangs there looking for something that will never exist.

*IBM's Question:  * Why is the "restore_command" only checks for the
.history file once and moves on instead of pooling for the .history file
for a specified period to make sure it "waits" for it to show up so it can
execute everything in the correct order.

*EDB current diagnose: * The* "restore_command*" on the M2 replica is
behaving as expected.  Based on a test case we've set up, we can see that
the*"restore_command"* is checking for both *.history *and the next
*"wal"* file
from the current timeline. *Based on this, we suspect the issue lies with
IBM's current configuration settings and we recommend IBM to try and
recreate the issue manually without Patroni to ensure their current
assumptions are indeed correct.*

*Here's the test case we ran as FYI:*

1) Setup a 3 node cluster with m0 (primary), m1 (replica) and m2(replica)
2) after m-0 is stopped and m-1 is not yet promoted, then m-2 logs contain
the following as expected:

*FATAL:  could not connect to the primary server: could not connect to
server: No such file or directoryIs the server running locally and
acceptingconnections on Unix domain socket
"/var/run/postgresql/.s.PGSQL.5432"?cp: cannot stat
‘/var/lib/pgsql/12/backups/archives/00000002.history’: No such file or
directorycp: cannot stat
‘/var/lib/pgsql/12/backups/archives/00000001000000010000004E’: No such file
or directory*

It confirms that when SR is failing, it tries to look for the
00000001000000010000004E wal or the history file (for the new timeline).

2) after m-1 gets promoted, we see the following files generated in the
archives repo:

*00000001000000010000004E.partial. # Wal timeline 1*
*00000002.history.  # history file for timeline 2*

3) as soon as those files comes up in the repo, m-2 is seeing it and play
it:

*LOG:  restored log file "00000002.history" from archive  # as you can see,
it reads the history file as soon as it appearscp: cannot stat
‘/var/lib/pgsql/12/backups/archives/00000003.history’: No such file or
directoryLOG:  restored log file "00000002.history" from archiveLOG:  new
target timeline is 2. # switches to timeline 2 after reading the .history
filecp: cannot stat
‘/var/lib/pgsql/12/backups/archives/00000002000000010000004E’: No such file
or directory. # this is happening because the primary conn info needs to be
changed to point to the m1 and this requires a db restart to get picked
up.cp: cannot stat
‘/var/lib/pgsql/12/backups/archives/00000001000000010000004E’: No such file
or directory*

On Wed, Nov 18, 2020 at 8:39 AM Brad Nicholson <br...@ca.ibm.com> wrote:

> Dilip Kumar <dilipbal...@gmail.com> wrote on 2020/11/15 04:47:12 AM:
> > I am not sure how Patroni does it internally,  can you explain the
> > scenario in more detail?  Suppose you are executing the promote on m-1
> > and if the promotion is successful it will switch the timeline and it
> > will create the timeline history file.  Now, once the promotion is
> > successful if we change the primary_conninfo on the m-2 then it will
> > restart the walsender and look for the latest .history file which it
> > should find either from direct streaming or through the
> > restore_command.  If you are saying that m-2 tried to look for the
> > history file before m-1 created it then it seems like you change the
> > primary_conninfo on m-2 before the m-1 promotion got completed.
>
>
> Patroni first issues a fast shutdown to m-0.  Once that completes, it
> issues a promote on m-1.
>
> In this case, m-2 is not connecting directly to to m-0 or m-1 via a
> hostname, it is connecting to a virtual layer that routes the connectivity
> to the current primary.  This does not change.
>
> Brad.
>
>
>

Re: Race condition with restore_command on streaming replica

Reply via email to