Re: allow online change primary_conninfo

Sergei Kornilov Wed, 28 Aug 2019 02:51:03 -0700

Hello

Updated patch attached. (also I merged into one file)


> + <para>
> + WAL receiver will be restarted after <varname>primary_slot_name</varname>
> + was changed.
>           </para>
> The sentence sounds strange. Here is a suggestion:
> The WAL receiver is restarted after an update of primary_slot_name (or
> primary_conninfo).

Changed.

> The comment at the top of the call of ProcessStartupSigHup() in
> HandleStartupProcInterrupts() needs to be updated as it mentions a
> configuration file re-read, but that's not the case anymore.

Changed to "Process any requests or signals received recently." like in other 
places, e.g. syslogger

> pendingRestartSource's name does not represent what it does, as it is
> only associated with the restart of a WAL receiver when in streaming
> state, and that's a no-op for the archive mode and the local mode.

I renamed to pendingWalRcvRestart and replaced switch with simple condition.

> So when shutting down the WAL receiver after a parameter change, what
> happens is that the startup process waits for retrieve_retry_interval
> before moving back to the archive mode. Only after scanning again the
> archives do we restart a new WAL receiver.

As I answered earlier, here is no switch to archive or 
wal_retrieve_retry_interval waiting in my patch. I recheck on current revision 
too:

2019-08-28 12:16:27.295 MSK 11180 @ from  [vxid: txid:0] [] DEBUG:  sending 
write 0/30346C8 flush 0/30346C8 apply 0/30346C8
2019-08-28 12:16:27.493 MSK 11172 @ from  [vxid: txid:0] [] LOG:  received 
SIGHUP, reloading configuration files
2019-08-28 12:16:27.494 MSK 11172 @ from  [vxid: txid:0] [] LOG:  parameter 
"primary_conninfo" changed to "host='/tmp' port=5555 sslmode=disable 
sslcompression=0 gssencmode=disable target_session_attrs=any"
2019-08-28 12:16:27.496 MSK 11173 @ from  [vxid:1/0 txid:0] [] LOG:  The WAL 
receiver is going to be restarted due to change of primary_conninfo
2019-08-28 12:16:27.496 MSK 11176 @ from  [vxid: txid:0] [] DEBUG:  
checkpointer updated shared memory configuration values
2019-08-28 12:16:27.496 MSK 11180 @ from  [vxid: txid:0] [] FATAL:  terminating 
walreceiver process due to administrator command
2019-08-28 12:16:27.500 MSK 11335 @ from  [vxid: txid:0] [] LOG:  started 
streaming WAL from primary at 0/3000000 on timeline 1
2019-08-28 12:16:27.500 MSK 11335 @ from  [vxid: txid:0] [] DEBUG:  sendtime 
2019-08-28 12:16:27.50037+03 receipttime 2019-08-28 12:16:27.500821+03 
replication apply delay 0 ms transfer latency 0 ms

No "DEBUG:  switched WAL source from stream to archive after failure" messages, 
no time difference (wal_retrieve_retry_interval = 5s).

regards, Sergei

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 89284dc5c0..b36749fe91 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3929,9 +3929,13 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
           <varname>primary_conninfo</varname> string.
          </para>
          <para>
-          This parameter can only be set at server start.
+          This parameter can only be set in the <filename>postgresql.conf</filename>
+          file or on the server command line.
           This setting has no effect if the server is not in standby mode.
          </para>
+         <para>
+          The WAL receiver is restarted after an update of <varname>primary_conninfo</varname>.
+         </para>
         </listitem>
        </varlistentry>
        <varlistentry id="guc-primary-slot-name" xreflabel="primary_slot_name">
@@ -3946,9 +3950,13 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
           connecting to the sending server via streaming replication to control
           resource removal on the upstream node
           (see <xref linkend="streaming-replication-slots"/>).
-          This parameter can only be set at server start.
+          This parameter can only be set in the <filename>postgresql.conf</filename>
+          file or on the server command line.
           This setting has no effect if <varname>primary_conninfo</varname> is not
-          set.
+          set or the server is not in standby mode.
+         </para>
+         <para>
+          The WAL receiver is restarted after an update of <varname>primary_slot_name</varname>.
          </para>
         </listitem>
        </varlistentry>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e651a841bb..4eed462f34 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -803,6 +803,12 @@ static XLogSource readSource = 0;	/* XLOG_FROM_* code */
 static XLogSource currentSource = 0;	/* XLOG_FROM_* code */
 static bool lastSourceFailed = false;
 
+/*
+ * Need for restart running WalReceiver due the configuration change.
+ * Suitable only for XLOG_FROM_STREAM source
+ */
+static bool pendingWalRcvRestart = false;
+
 typedef struct XLogPageReadPrivate
 {
 	int			emode;
@@ -11787,48 +11793,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					if (!StandbyMode)
 						return false;
 
-					/*
-					 * If primary_conninfo is set, launch walreceiver to try
-					 * to stream the missing WAL.
-					 *
-					 * If fetching_ckpt is true, RecPtr points to the initial
-					 * checkpoint location. In that case, we use RedoStartLSN
-					 * as the streaming start position instead of RecPtr, so
-					 * that when we later jump backwards to start redo at
-					 * RedoStartLSN, we will have the logs streamed already.
-					 */
-					if (PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0)
-					{
-						XLogRecPtr	ptr;
-						TimeLineID	tli;
-
-						if (fetching_ckpt)
-						{
-							ptr = RedoStartLSN;
-							tli = ControlFile->checkPointCopy.ThisTimeLineID;
-						}
-						else
-						{
-							ptr = RecPtr;
-
-							/*
-							 * Use the record begin position to determine the
-							 * TLI, rather than the position we're reading.
-							 */
-							tli = tliOfPointInHistory(tliRecPtr, expectedTLEs);
-
-							if (curFileTLI > 0 && tli < curFileTLI)
-								elog(ERROR, "according to history file, WAL location %X/%X belongs to timeline %u, but previous recovered WAL file came from timeline %u",
-									 (uint32) (tliRecPtr >> 32),
-									 (uint32) tliRecPtr,
-									 tli, curFileTLI);
-						}
-						curFileTLI = tli;
-						RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
-											 PrimarySlotName);
-						receivedUpto = 0;
-					}
-
 					/*
 					 * Move to XLOG_FROM_STREAM state in either case. We'll
 					 * get immediate failure if we didn't launch walreceiver,
@@ -11928,10 +11892,66 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				 lastSourceFailed ? "failure" : "success");
 
 		/*
-		 * We've now handled possible failure. Try to read from the chosen
-		 * source.
+		 * Request walreceiver to start if we switch from another source or if
+		 * we need to change walreceiver connection configuration.
+		 */
+		if (currentSource == XLOG_FROM_STREAM && (lastSourceFailed || pendingWalRcvRestart))
+		{
+			/*
+			 * Ensure walreceiver is not running
+			 */
+			if (WalRcvRunning())
+				ShutdownWalRcv();
+
+			/*
+			 * If primary_conninfo is set, launch walreceiver to try to stream
+			 * the missing WAL.
+			 *
+			 * If fetching_ckpt is true, RecPtr points to the initial
+			 * checkpoint location. In that case, we use RedoStartLSN as the
+			 * streaming start position instead of RecPtr, so that when we
+			 * later jump backwards to start redo at RedoStartLSN, we will
+			 * have the logs streamed already.
+			 */
+			if (PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0)
+			{
+				XLogRecPtr	ptr;
+				TimeLineID	tli;
+
+				if (fetching_ckpt)
+				{
+					ptr = RedoStartLSN;
+					tli = ControlFile->checkPointCopy.ThisTimeLineID;
+				}
+				else
+				{
+					ptr = RecPtr;
+
+					/*
+					 * Use the record begin position to determine the TLI,
+					 * rather than the position we're reading.
+					 */
+					tli = tliOfPointInHistory(tliRecPtr, expectedTLEs);
+
+					if (curFileTLI > 0 && tli < curFileTLI)
+						elog(ERROR, "according to history file, WAL location %X/%X belongs to timeline %u, but previous recovered WAL file came from timeline %u",
+							 (uint32) (tliRecPtr >> 32),
+							 (uint32) tliRecPtr,
+							 tli, curFileTLI);
+				}
+				curFileTLI = tli;
+				RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
+									 PrimarySlotName);
+				receivedUpto = 0;
+			}
+		}
+
+		/*
+		 * We've now handled possible failure and configuration change. Try to
+		 * read from the chosen source.
 		 */
 		lastSourceFailed = false;
+		pendingWalRcvRestart = false;
 
 		switch (currentSource)
 		{
@@ -12096,6 +12116,45 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	return false;				/* not reached */
 }
 
+/*
+ * Re-read config file and plan to restart running walreceiver if
+ * connection settings was changed.
+ */
+void
+ProcessStartupSigHup(void)
+{
+	char	   *conninfo = pstrdup(PrimaryConnInfo);
+	char	   *slotname = pstrdup(PrimarySlotName);
+	bool		conninfoChanged;
+	bool		slotnameChanged;
+
+	ProcessConfigFile(PGC_SIGHUP);
+
+	/*
+	 * We need restart walreceiver if replication settings was changed.
+	 */
+	conninfoChanged = (strcmp(conninfo, PrimaryConnInfo) != 0);
+	slotnameChanged = (strcmp(slotname, PrimarySlotName) != 0);
+
+	if ((conninfoChanged || slotnameChanged) &&
+		currentSource == XLOG_FROM_STREAM
+		&& WalRcvRunning())
+	{
+		if (conninfoChanged && slotnameChanged)
+			ereport(LOG,
+					(errmsg("The WAL receiver is going to be restarted due to change of primary_conninfo and primary_slot_name")));
+		else
+			ereport(LOG,
+					(errmsg("The WAL receiver is going to be restarted due to change of %s",
+							conninfoChanged ? "primary_conninfo" : "primary_slot_name")));
+
+		pendingWalRcvRestart = true;
+	}
+
+	pfree(conninfo);
+	pfree(slotname);
+}
+
 /*
  * Determine what log level should be used to report a corrupt WAL record
  * in the current WAL page, previously read by XLogPageRead().
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index 5048a2c2aa..62492daf6b 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -142,12 +142,12 @@ void
 HandleStartupProcInterrupts(void)
 {
 	/*
-	 * Check if we were requested to re-read config file.
+	 * Process any requests or signals received recently.
 	 */
 	if (got_SIGHUP)
 	{
 		got_SIGHUP = false;
-		ProcessConfigFile(PGC_SIGHUP);
+		ProcessStartupSigHup();
 	}
 
 	/*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 90ffd89339..5ffe593517 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3558,7 +3558,7 @@ static struct config_string ConfigureNamesString[] =
 	},
 
 	{
-		{"primary_conninfo", PGC_POSTMASTER, REPLICATION_STANDBY,
+		{"primary_conninfo", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Sets the connection string to be used to connect to the sending server."),
 			NULL,
 			GUC_SUPERUSER_ONLY
@@ -3569,7 +3569,7 @@ static struct config_string ConfigureNamesString[] =
 	},
 
 	{
-		{"primary_slot_name", PGC_POSTMASTER, REPLICATION_STANDBY,
+		{"primary_slot_name", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Sets the name of the replication slot to use on the sending server."),
 			NULL
 		},
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d519252aad..9e49020b19 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -320,6 +320,7 @@ extern void SetWalWriterSleeping(bool sleeping);
 
 extern void XLogRequestWalReceiverReply(void);
 
+extern void ProcessStartupSigHup(void);
 extern void assign_max_wal_size(int newval, void *extra);
 extern void assign_checkpoint_completion_target(double newval, void *extra);
 
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 3c743d7d7c..ae80f4df3a 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 32;
+use Test::More tests => 33;
 
 # Initialize master node
 my $node_master = get_new_node('master');
@@ -208,7 +208,9 @@ $node_standby_2->append_conf('postgresql.conf',
 	"primary_slot_name = $slotname_2");
 $node_standby_2->append_conf('postgresql.conf',
 	"wal_receiver_status_interval = 1");
-$node_standby_2->restart;
+# should be able change primary_slot_name without restart
+# will wait effect in get_slot_xmins above
+$node_standby_2->reload;
 
 # Fetch xmin columns from slot's pg_replication_slots row, after waiting for
 # given boolean condition to be true to ensure we've reached a quiescent state
@@ -344,3 +346,21 @@ is($catalog_xmin, '',
 is($xmin, '', 'xmin of cascaded slot null with hs feedback reset');
 is($catalog_xmin, '',
 	'catalog xmin of cascaded slot still null with hs_feedback reset');
+
+note "check change primary_conninfo without restart";
+$node_standby_2->append_conf('postgresql.conf',
+	"primary_slot_name = ''");
+$node_standby_2->enable_streaming($node_master);
+$node_standby_2->reload;
+
+# be sure do not streaming from cascade
+$node_standby_1->stop;
+
+my $newval = $node_master->safe_psql('postgres',
+'INSERT INTO replayed(val) SELECT coalesce(max(val),0) + 1 AS newval FROM replayed RETURNING val'
+);
+$node_master->wait_for_catchup($node_standby_2, 'replay',
+	$node_master->lsn('insert'));
+my $is_replayed = $node_standby_2->safe_psql('postgres',
+	qq[SELECT 1 FROM replayed WHERE val = $newval]);
+is($is_replayed, qq(1), "standby_2 didn't replay master value $newval");

Re: allow online change primary_conninfo

Reply via email to