Anthonin Bonnefoy <anthonin.bonne...@datadoghq.com> writes:
> I've updated the patch with another approach:

I got back to looking at this today.  I still don't like very much
about it.  After thinking for awhile, I concur that we want to create
the cmd_context under TopMemoryContext, but I think we can make things
a great deal simpler and less fragile by just keeping it in existence
indefinitely, as attached.  The fundamental problem here is that
exec_replication_command is creating and deleting the context without
regard to where transaction boundaries are.  I don't think the best
way to deal with that is to require it to know where those boundaries
are.

I also don't like the proposed test script.  The test case alone would
be fine, but spinning up an entire new instance seems a rather huge
overhead to carry forevermore for a single test that likely will never
find anything again.  I tried to cram the test case into one of the
existing scripts, but it was hard to find one where it would fit
naturally.  An obvious candidate is subscription/t/100_bugs.pl, but
the test failed there for lack of access to the test_decoding plugin.
Perhaps we could use another plugin, but I didn't try hard.

As for the memory-context-bug-catching changes, I'm not very convinced
by the idea of setting context->type = T_Invalid.  That would help
if someone tries to reference the context while it's still in the
freelist, but not if the reference is from re-use of a stale pointer
after the context has been reassigned (which IIUC is the case here).
I also think that decorating a few MemoryContextSwitchTos with
assertions looks pretty random.  I wonder if we should content
ourselves with adding some assertions that catch attempts to make
a context be its own parent.  It looks like MemoryContextSetParent
already does that, so the only new assert would be in
MemoryContextCreate.

                        regards, tom lane

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 987a7336ec8..9fa8beb6103 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1973,8 +1973,10 @@ exec_replication_command(const char *cmd_string)
 	int			parse_rc;
 	Node	   *cmd_node;
 	const char *cmdtag;
-	MemoryContext cmd_context;
-	MemoryContext old_context;
+	MemoryContext old_context = CurrentMemoryContext;
+
+	/* We save and re-use the cmd_context across calls */
+	static MemoryContext cmd_context = NULL;
 
 	/*
 	 * If WAL sender has been told that shutdown is getting close, switch its
@@ -2003,11 +2005,30 @@ exec_replication_command(const char *cmd_string)
 
 	/*
 	 * Prepare to parse and execute the command.
+	 *
+	 * Because replication command execution can involve beginning or ending
+	 * transactions, we need a working context that will survive that, so we
+	 * make it a child of TopMemoryContext.  That in turn creates a hazard of
+	 * long-lived memory leaks if we lose track of the working context.  We
+	 * deal with that by creating it only once per walsender, and resetting it
+	 * for each new command.  (Normally this reset is a no-op, but if the
+	 * prior exec_replication_command call failed with an error, it won't be.)
+	 *
+	 * This is subtler than it looks.  The transactions we manage can extend
+	 * across replication commands, indeed SnapBuildClearExportedSnapshot
+	 * might have just ended one.  Because transaction exit will revert to the
+	 * memory context that was current at transaction start, we need to be
+	 * sure that that context is still valid.  That motivates re-using the
+	 * same cmd_context rather than making a new one each time.
 	 */
-	cmd_context = AllocSetContextCreate(CurrentMemoryContext,
-										"Replication command context",
-										ALLOCSET_DEFAULT_SIZES);
-	old_context = MemoryContextSwitchTo(cmd_context);
+	if (cmd_context == NULL)
+		cmd_context = AllocSetContextCreate(TopMemoryContext,
+											"Replication command context",
+											ALLOCSET_DEFAULT_SIZES);
+	else
+		MemoryContextReset(cmd_context);
+
+	MemoryContextSwitchTo(cmd_context);
 
 	replication_scanner_init(cmd_string, &scanner);
 
@@ -2020,7 +2041,7 @@ exec_replication_command(const char *cmd_string)
 		replication_scanner_finish(scanner);
 
 		MemoryContextSwitchTo(old_context);
-		MemoryContextDelete(cmd_context);
+		MemoryContextReset(cmd_context);
 
 		/* XXX this is a pretty random place to make this check */
 		if (MyDatabaseId == InvalidOid)
@@ -2180,9 +2201,12 @@ exec_replication_command(const char *cmd_string)
 				 cmd_node->type);
 	}
 
-	/* done */
+	/*
+	 * Done.  Revert to caller's memory context, and clean out the cmd_context
+	 * to recover memory right away.
+	 */
 	MemoryContextSwitchTo(old_context);
-	MemoryContextDelete(cmd_context);
+	MemoryContextReset(cmd_context);
 
 	/*
 	 * We need not update ps display or pg_stat_activity, because PostgresMain

Reply via email to