While trying to understand a recent buildfarm failure [1], I got annoyed by the fact that a walsender exposes its last SQL command in pg_stat_activity even when that was a long time ago and what it's been doing since then is replication commands. This leads to *totally* misleading reports, for instance in [1] we read
2020-09-13 19:10:09.632 CEST [5f5e526d.242415:4] LOG: server process (PID 2368853) was terminated by signal 11: Segmentation fault 2020-09-13 19:10:09.632 CEST [5f5e526d.242415:5] DETAIL: Failed process was running: SELECT pg_catalog.set_config('search_path', '', false); even though the process's own log messages paint a much better picture of reality: 2020-09-13 19:10:07.302 CEST [5f5e526f.242555:1] LOG: connection received: host=[local] 2020-09-13 19:10:07.303 CEST [5f5e526f.242555:2] LOG: replication connection authorized: user=bf application_name=sub2 2020-09-13 19:10:07.303 CEST [5f5e526f.242555:3] LOG: statement: SELECT pg_catalog.set_config('search_path', '', false); 2020-09-13 19:10:07.334 CEST [5f5e526f.242555:4] LOG: received replication command: IDENTIFY_SYSTEM 2020-09-13 19:10:07.335 CEST [5f5e526f.242555:5] LOG: received replication command: START_REPLICATION SLOT "sub2" LOGICAL 0/0 (proto_version '2', publication_names '"pub2"') 2020-09-13 19:10:07.335 CEST [5f5e526f.242555:6] LOG: starting logical decoding for slot "sub2" 2020-09-13 19:10:07.335 CEST [5f5e526f.242555:7] DETAIL: Streaming transactions committing after 0/159EB38, reading WAL from 0/159EB00. 2020-09-13 19:10:07.335 CEST [5f5e526f.242555:8] LOG: logical decoding found consistent point at 0/159EB00 2020-09-13 19:10:07.335 CEST [5f5e526f.242555:9] DETAIL: There are no running transactions. I think that walsenders ought to report their current replication command as though it were a SQL command. They oughta set debug_query_string, too. While trying to fix this, I also observed that exec_replication_command fails to clean up the temp context it made for parsing the command string, if that turns out to be a SQL command. This very accidentally fails to lead to a long-term memory leak, because that context will be a child of MessageContext so it'll be cleaned out in the next iteration of PostgresMain's main loop. But it's still unacceptably sloppy coding IMHO, and it's closely related to a lot of other randomness in the function; it'd be better to have a separate early-exit path for SQL commands. Attached find a proposed fix for these things. What I'd really like to do is adjust pgstat_report_activity so that callers are *required* to provide some kind of command string when transitioning into STATE_RUNNING state, ie something like Assert(state == STATE_RUNNING ? cmd_str != NULL : cmd_str == NULL); However, before we could do that, we'd have to clean up the rat's nest of seemingly-inserted-with-the-aid-of-a-dartboard pgstat_report_activity calls in replication/logical/worker.c. I couldn't make heads or tails of what is going on there, so I did not try. BTW, while I didn't change it here, isn't the second SnapBuildClearExportedSnapshot call in exec_replication_command just useless? We already did that before parsing the command. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2020-09-13%2016%3A54%3A03
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 3f756b470a..9a2b154797 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -1545,6 +1545,9 @@ exec_replication_command(const char *cmd_string) CHECK_FOR_INTERRUPTS(); + /* + * Parse the command. + */ cmd_context = AllocSetContextCreate(CurrentMemoryContext, "Replication command context", ALLOCSET_DEFAULT_SIZES); @@ -1560,15 +1563,38 @@ exec_replication_command(const char *cmd_string) cmd_node = replication_parse_result; + /* + * If it's a SQL command, just clean up our mess and return false; the + * caller will take care of executing it. + */ + if (IsA(cmd_node, SQLCmd)) + { + if (MyDatabaseId == InvalidOid) + ereport(ERROR, + (errmsg("cannot execute SQL commands in WAL sender for physical replication"))); + + MemoryContextSwitchTo(old_context); + MemoryContextDelete(cmd_context); + + /* Tell the caller that this wasn't a WalSender command. */ + return false; + } + + /* + * Report query to various monitoring facilities. For this purpose, we + * report replication commands just like SQL commands. + */ + debug_query_string = cmd_string; + + pgstat_report_activity(STATE_RUNNING, cmd_string); + /* * Log replication command if log_replication_commands is enabled. Even * when it's disabled, log the command with DEBUG1 level for backward - * compatibility. Note that SQL commands are not logged here, and will be - * logged later if log_statement is enabled. + * compatibility. */ - if (cmd_node->type != T_SQLCmd) - ereport(log_replication_commands ? LOG : DEBUG1, - (errmsg("received replication command: %s", cmd_string))); + ereport(log_replication_commands ? LOG : DEBUG1, + (errmsg("received replication command: %s", cmd_string))); /* * CREATE_REPLICATION_SLOT ... LOGICAL exports a snapshot. If it was @@ -1578,10 +1604,9 @@ exec_replication_command(const char *cmd_string) SnapBuildClearExportedSnapshot(); /* - * For aborted transactions, don't allow anything except pure SQL, the - * exec_simple_query() will handle it correctly. + * Disallow replication commands in aborted transaction blocks. */ - if (IsAbortedTransactionBlockState() && !IsA(cmd_node, SQLCmd)) + if (IsAbortedTransactionBlockState()) ereport(ERROR, (errcode(ERRCODE_IN_FAILED_SQL_TRANSACTION), errmsg("current transaction is aborted, " @@ -1597,9 +1622,6 @@ exec_replication_command(const char *cmd_string) initStringInfo(&reply_message); initStringInfo(&tmpbuf); - /* Report to pgstat that this process is running */ - pgstat_report_activity(STATE_RUNNING, NULL); - switch (cmd_node->type) { case T_IdentifySystemCmd: @@ -1651,17 +1673,6 @@ exec_replication_command(const char *cmd_string) } break; - case T_SQLCmd: - if (MyDatabaseId == InvalidOid) - ereport(ERROR, - (errmsg("cannot execute SQL commands in WAL sender for physical replication"))); - - /* Report to pgstat that this process is now idle */ - pgstat_report_activity(STATE_IDLE, NULL); - - /* Tell the caller that this wasn't a WalSender command. */ - return false; - default: elog(ERROR, "unrecognized replication command node tag: %u", cmd_node->type); @@ -1677,6 +1688,7 @@ exec_replication_command(const char *cmd_string) /* Report to pgstat that this process is now idle */ pgstat_report_activity(STATE_IDLE, NULL); + debug_query_string = NULL; return true; }