While trying to understand a recent buildfarm failure [1], I got annoyed
by the fact that a walsender exposes its last SQL command in
pg_stat_activity even when that was a long time ago and what it's been
doing since then is replication commands.  This leads to *totally*
misleading reports, for instance in [1] we read

2020-09-13 19:10:09.632 CEST [5f5e526d.242415:4] LOG:  server process (PID 
2368853) was terminated by signal 11: Segmentation fault
2020-09-13 19:10:09.632 CEST [5f5e526d.242415:5] DETAIL:  Failed process was 
running: SELECT pg_catalog.set_config('search_path', '', false);

even though the process's own log messages paint a much better picture of
reality:

2020-09-13 19:10:07.302 CEST [5f5e526f.242555:1] LOG:  connection received: 
host=[local]
2020-09-13 19:10:07.303 CEST [5f5e526f.242555:2] LOG:  replication connection 
authorized: user=bf application_name=sub2
2020-09-13 19:10:07.303 CEST [5f5e526f.242555:3] LOG:  statement: SELECT 
pg_catalog.set_config('search_path', '', false);
2020-09-13 19:10:07.334 CEST [5f5e526f.242555:4] LOG:  received replication 
command: IDENTIFY_SYSTEM
2020-09-13 19:10:07.335 CEST [5f5e526f.242555:5] LOG:  received replication 
command: START_REPLICATION SLOT "sub2" LOGICAL 0/0 (proto_version '2', 
publication_names '"pub2"')
2020-09-13 19:10:07.335 CEST [5f5e526f.242555:6] LOG:  starting logical 
decoding for slot "sub2"
2020-09-13 19:10:07.335 CEST [5f5e526f.242555:7] DETAIL:  Streaming 
transactions committing after 0/159EB38, reading WAL from 0/159EB00.
2020-09-13 19:10:07.335 CEST [5f5e526f.242555:8] LOG:  logical decoding found 
consistent point at 0/159EB00
2020-09-13 19:10:07.335 CEST [5f5e526f.242555:9] DETAIL:  There are no running 
transactions.

I think that walsenders ought to report their current replication command
as though it were a SQL command.  They oughta set debug_query_string, too.

While trying to fix this, I also observed that exec_replication_command
fails to clean up the temp context it made for parsing the command string,
if that turns out to be a SQL command.  This very accidentally fails to
lead to a long-term memory leak, because that context will be a child of
MessageContext so it'll be cleaned out in the next iteration of
PostgresMain's main loop.  But it's still unacceptably sloppy coding
IMHO, and it's closely related to a lot of other randomness in the
function; it'd be better to have a separate early-exit path for SQL
commands.

Attached find a proposed fix for these things.

What I'd really like to do is adjust pgstat_report_activity so that
callers are *required* to provide some kind of command string when
transitioning into STATE_RUNNING state, ie something like

        Assert(state == STATE_RUNNING ? cmd_str != NULL : cmd_str == NULL);

However, before we could do that, we'd have to clean up the rat's nest
of seemingly-inserted-with-the-aid-of-a-dartboard pgstat_report_activity
calls in replication/logical/worker.c.  I couldn't make heads or tails
of what is going on there, so I did not try.

BTW, while I didn't change it here, isn't the second
SnapBuildClearExportedSnapshot call in exec_replication_command just
useless?  We already did that before parsing the command.

                        regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2020-09-13%2016%3A54%3A03

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3f756b470a..9a2b154797 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1545,6 +1545,9 @@ exec_replication_command(const char *cmd_string)
 
 	CHECK_FOR_INTERRUPTS();
 
+	/*
+	 * Parse the command.
+	 */
 	cmd_context = AllocSetContextCreate(CurrentMemoryContext,
 										"Replication command context",
 										ALLOCSET_DEFAULT_SIZES);
@@ -1560,15 +1563,38 @@ exec_replication_command(const char *cmd_string)
 
 	cmd_node = replication_parse_result;
 
+	/*
+	 * If it's a SQL command, just clean up our mess and return false; the
+	 * caller will take care of executing it.
+	 */
+	if (IsA(cmd_node, SQLCmd))
+	{
+		if (MyDatabaseId == InvalidOid)
+			ereport(ERROR,
+					(errmsg("cannot execute SQL commands in WAL sender for physical replication")));
+
+		MemoryContextSwitchTo(old_context);
+		MemoryContextDelete(cmd_context);
+
+		/* Tell the caller that this wasn't a WalSender command. */
+		return false;
+	}
+
+	/*
+	 * Report query to various monitoring facilities.  For this purpose, we
+	 * report replication commands just like SQL commands.
+	 */
+	debug_query_string = cmd_string;
+
+	pgstat_report_activity(STATE_RUNNING, cmd_string);
+
 	/*
 	 * Log replication command if log_replication_commands is enabled. Even
 	 * when it's disabled, log the command with DEBUG1 level for backward
-	 * compatibility. Note that SQL commands are not logged here, and will be
-	 * logged later if log_statement is enabled.
+	 * compatibility.
 	 */
-	if (cmd_node->type != T_SQLCmd)
-		ereport(log_replication_commands ? LOG : DEBUG1,
-				(errmsg("received replication command: %s", cmd_string)));
+	ereport(log_replication_commands ? LOG : DEBUG1,
+			(errmsg("received replication command: %s", cmd_string)));
 
 	/*
 	 * CREATE_REPLICATION_SLOT ... LOGICAL exports a snapshot. If it was
@@ -1578,10 +1604,9 @@ exec_replication_command(const char *cmd_string)
 		SnapBuildClearExportedSnapshot();
 
 	/*
-	 * For aborted transactions, don't allow anything except pure SQL, the
-	 * exec_simple_query() will handle it correctly.
+	 * Disallow replication commands in aborted transaction blocks.
 	 */
-	if (IsAbortedTransactionBlockState() && !IsA(cmd_node, SQLCmd))
+	if (IsAbortedTransactionBlockState())
 		ereport(ERROR,
 				(errcode(ERRCODE_IN_FAILED_SQL_TRANSACTION),
 				 errmsg("current transaction is aborted, "
@@ -1597,9 +1622,6 @@ exec_replication_command(const char *cmd_string)
 	initStringInfo(&reply_message);
 	initStringInfo(&tmpbuf);
 
-	/* Report to pgstat that this process is running */
-	pgstat_report_activity(STATE_RUNNING, NULL);
-
 	switch (cmd_node->type)
 	{
 		case T_IdentifySystemCmd:
@@ -1651,17 +1673,6 @@ exec_replication_command(const char *cmd_string)
 			}
 			break;
 
-		case T_SQLCmd:
-			if (MyDatabaseId == InvalidOid)
-				ereport(ERROR,
-						(errmsg("cannot execute SQL commands in WAL sender for physical replication")));
-
-			/* Report to pgstat that this process is now idle */
-			pgstat_report_activity(STATE_IDLE, NULL);
-
-			/* Tell the caller that this wasn't a WalSender command. */
-			return false;
-
 		default:
 			elog(ERROR, "unrecognized replication command node tag: %u",
 				 cmd_node->type);
@@ -1677,6 +1688,7 @@ exec_replication_command(const char *cmd_string)
 
 	/* Report to pgstat that this process is now idle */
 	pgstat_report_activity(STATE_IDLE, NULL);
+	debug_query_string = NULL;
 
 	return true;
 }

Reply via email to