Hi all,

This is a follow-up of ea792bfd93ab and this thread where I've noticed
that some memory was still leaking when running sysbench with a
logical replication setup:
https://www.postgresql.org/message-id/zz7srcbxxhoyn...@paquier.xyz

Reproducing the problem is quite simple, and can be done with the
following steps (please adapt, like previous thread):
1) Initialize a primary and set up some empty tables:
NUM_TABLES=500
PRIMARY_PORT=5432
STANDBY_PORT=5433
sysbench oltp_read_only --db-driver=pgsql \
                 --pgsql-port=$PRIMARY_PORT \
                 --pgsql-db=postgres \
                 --pgsql-user=postgres \
                 --pgsql-host=/tmp \
                 --tables=$NUM_TABLES --table_size=0 \
                 --report-interval=1 \
                 --threads=1 prepare
2) Create a standby, promote it:
STANDBY_DATA=/define/your/standby/path/
pg_basebackup --checkpoint=fast -D $STANDBY_DATA -p $PRIMARY_PORT -h /tmp/ -R
echo "port = $STANDBY_PORT" >> $STANDBY_DATA/postgresql.conf
pg_ctl -D $STANDBY_DATA start
pg_ctl -D $STANDBY_DATA promote
3) Publication and subscription
psql -p $PRIMARY_PORT -c "CREATE PUBLICATION pub1 FOR ALL TABLES;"
psql -p $STANDBY_PORT -c "CREATE SUBSCRIPTION sub1 CONNECTION
'port=$PRIMARY_PORT user=postgres dbname=postgres PUBLICATION pub1 WITH
(enabled, create_slot, slot_name='pub1_to_sub1', copy_data=false);"
4) Run sysbench:
sysbench oltp_write_only --db-driver=pgsql \
                 --pgsql-port=$PRIMARY_PORT \
                 --pgsql-db=postgres \
                 --pgsql-user=postgres \
                 --pgsql-host=/tmp \
                 --tables=$NUM_TABLES --table_size=100 \
                 --report-interval=1 \
                 --threads=$NUM_THREADS run \
                 --time=5000

With that, I've mentioned internally that there was an extra leak and
left the problem lying aside for a few days before being able to come
back to it.  In-between, Jeff Davis has done a review of the code to
note that we leak memory in the CacheMemoryContext, due to the
list_free_deep() done in get_rel_sync_entry() that misses the fact
that the publication name is pstrdup()'d in GetPublication(), but
list_free_deep() does not know that so the memory of the strdupped
publication names stays around.  That's wrong.

Back to today, I've done more benchmarking using the above steps and
logged periodic memory samples of the WAL sender with
pg_log_backend_memory_contexts() (100s, 50 points), and analyzed the 
evolution of the whole thing. "Grand Total" used memory keeps
increasing due to one memory context: CacheMemoryContext.  So yes,
Jeff has spotted what looks like the only leak we have. 

Attached is a graph that shows the evolution of memory between HEAD
and a patched version for the used memory of CacheMemoryContext
(ylabel is in bytes, could not get that right as my gnuplot skills are
bad).

One thing to note in this case is that I've made the leak more
noticeable with this tweak to force the publication to be reset each
time with go through get_rel_sync_entry(), or memory takes much longer
to pile up.  That does not change the problem, speeds up the detection
by a large factor:
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -2061,12 +2061,13 @@ get_rel_sync_entry(PGOutputData *data, Relation 
relation)
         List       *rel_publications = NIL;
 
         /* Reload publications if needed before use. */
-        if (!publications_valid)
+        elog(WARNING, "forcing reload publications");
         {
             oldctx = MemoryContextSwitchTo(CacheMemoryContext);
             if (data->publications)
             {
                 list_free_deep(data->publications);

This problem exists since the introduction of logical replication,
down to v10.  It would be sad to have an extra loop on publications to
free explicitely the publication names, and the issue would still be
hidden to the existing callers of GetPublication(), so I'd suggest to
change the Publication structure to use a NAMEDATALEN string to force
the free to happen, as in the attached.

That means an ABI break.  I've looked around and did not notice any
out-of-core code relying on sizeof(Publication).  That does not mean
that nothing would break, of course, but the odds should be pretty
good as this is a rather internal structure.  One way would be to just
fix that on HEAD and call it a day, but I'm aware of heavy logirep
users whose workloads would be able to see memory bloating because
they maintain WAL senders around.

Thoughts or comments?
--
Michael
0,3968872,4014248
100,4011832,4014248
200,3997928,4001048
300,4022224,4001048
400,4040608,3984520
500,4042552,3999448
600,4053576,3999448
700,4067464,3938632
800,4068992,3998120
900,4080632,3998120
1000,4090056,3890312
1100,4083968,3998120
1200,4106728,3998120
1300,4119200,3949688
1400,4104680,3998120
1500,4134536,3998120
1600,4144408,3905128
1700,4137608,3998120
1800,4159744,3998120
1900,4171488,3952232
2000,4170472,3998120
2100,4186584,3998120
2200,4196520,3946040
2300,4192296,3998120
2400,4211712,3998120
2500,4221792,3916296
2600,4214752,3998120
2700,4237344,3998120
2800,4245656,3916184
2900,4244736,3998120
3000,4260920,3998120
3100,4268488,3937304
3200,4281000,3998120
3300,4285752,3998120
3400,4296144,3903688
3500,4294672,3998120
3600,4310760,3996904
3700,4320624,3945928
3800,4313824,3996904
3900,4335768,3998120
4000,4345776,3940952
4100,4355952,3998120
4200,4360992,3998120
4300,4370616,3898712
4400,4360504,3998120
4500,4385616,3998120
4600,4394168,3952120
4700,4394104,3998120
4800,4408280,3998120
4900,4418976,3931000
diff --git a/src/include/catalog/pg_publication.h b/src/include/catalog/pg_publication.h
index 9a83a72d6b..809bd1f0a6 100644
--- a/src/include/catalog/pg_publication.h
+++ b/src/include/catalog/pg_publication.h
@@ -103,7 +103,7 @@ typedef struct PublicationDesc
 typedef struct Publication
 {
 	Oid			oid;
-	char	   *name;
+	char		name[NAMEDATALEN];
 	bool		alltables;
 	bool		pubviaroot;
 	bool		pubgencols;
diff --git a/src/backend/catalog/pg_publication.c b/src/backend/catalog/pg_publication.c
index 9bbb60463f..8f4e3b83f1 100644
--- a/src/backend/catalog/pg_publication.c
+++ b/src/backend/catalog/pg_publication.c
@@ -1061,7 +1061,7 @@ GetPublication(Oid pubid)
 
 	pub = (Publication *) palloc(sizeof(Publication));
 	pub->oid = pubid;
-	pub->name = pstrdup(NameStr(pubform->pubname));
+	strlcpy(pub->name, NameStr(pubform->pubname), NAMEDATALEN);
 	pub->alltables = pubform->puballtables;
 	pub->pubactions.pubinsert = pubform->pubinsert;
 	pub->pubactions.pubupdate = pubform->pubupdate;

Attachment: signature.asc
Description: PGP signature

Reply via email to