Hi all, This is a follow-up of ea792bfd93ab and this thread where I've noticed that some memory was still leaking when running sysbench with a logical replication setup: https://www.postgresql.org/message-id/zz7srcbxxhoyn...@paquier.xyz
Reproducing the problem is quite simple, and can be done with the following steps (please adapt, like previous thread): 1) Initialize a primary and set up some empty tables: NUM_TABLES=500 PRIMARY_PORT=5432 STANDBY_PORT=5433 sysbench oltp_read_only --db-driver=pgsql \ --pgsql-port=$PRIMARY_PORT \ --pgsql-db=postgres \ --pgsql-user=postgres \ --pgsql-host=/tmp \ --tables=$NUM_TABLES --table_size=0 \ --report-interval=1 \ --threads=1 prepare 2) Create a standby, promote it: STANDBY_DATA=/define/your/standby/path/ pg_basebackup --checkpoint=fast -D $STANDBY_DATA -p $PRIMARY_PORT -h /tmp/ -R echo "port = $STANDBY_PORT" >> $STANDBY_DATA/postgresql.conf pg_ctl -D $STANDBY_DATA start pg_ctl -D $STANDBY_DATA promote 3) Publication and subscription psql -p $PRIMARY_PORT -c "CREATE PUBLICATION pub1 FOR ALL TABLES;" psql -p $STANDBY_PORT -c "CREATE SUBSCRIPTION sub1 CONNECTION 'port=$PRIMARY_PORT user=postgres dbname=postgres PUBLICATION pub1 WITH (enabled, create_slot, slot_name='pub1_to_sub1', copy_data=false);" 4) Run sysbench: sysbench oltp_write_only --db-driver=pgsql \ --pgsql-port=$PRIMARY_PORT \ --pgsql-db=postgres \ --pgsql-user=postgres \ --pgsql-host=/tmp \ --tables=$NUM_TABLES --table_size=100 \ --report-interval=1 \ --threads=$NUM_THREADS run \ --time=5000 With that, I've mentioned internally that there was an extra leak and left the problem lying aside for a few days before being able to come back to it. In-between, Jeff Davis has done a review of the code to note that we leak memory in the CacheMemoryContext, due to the list_free_deep() done in get_rel_sync_entry() that misses the fact that the publication name is pstrdup()'d in GetPublication(), but list_free_deep() does not know that so the memory of the strdupped publication names stays around. That's wrong. Back to today, I've done more benchmarking using the above steps and logged periodic memory samples of the WAL sender with pg_log_backend_memory_contexts() (100s, 50 points), and analyzed the evolution of the whole thing. "Grand Total" used memory keeps increasing due to one memory context: CacheMemoryContext. So yes, Jeff has spotted what looks like the only leak we have. Attached is a graph that shows the evolution of memory between HEAD and a patched version for the used memory of CacheMemoryContext (ylabel is in bytes, could not get that right as my gnuplot skills are bad). One thing to note in this case is that I've made the leak more noticeable with this tweak to force the publication to be reset each time with go through get_rel_sync_entry(), or memory takes much longer to pile up. That does not change the problem, speeds up the detection by a large factor: --- a/src/backend/replication/pgoutput/pgoutput.c +++ b/src/backend/replication/pgoutput/pgoutput.c @@ -2061,12 +2061,13 @@ get_rel_sync_entry(PGOutputData *data, Relation relation) List *rel_publications = NIL; /* Reload publications if needed before use. */ - if (!publications_valid) + elog(WARNING, "forcing reload publications"); { oldctx = MemoryContextSwitchTo(CacheMemoryContext); if (data->publications) { list_free_deep(data->publications); This problem exists since the introduction of logical replication, down to v10. It would be sad to have an extra loop on publications to free explicitely the publication names, and the issue would still be hidden to the existing callers of GetPublication(), so I'd suggest to change the Publication structure to use a NAMEDATALEN string to force the free to happen, as in the attached. That means an ABI break. I've looked around and did not notice any out-of-core code relying on sizeof(Publication). That does not mean that nothing would break, of course, but the odds should be pretty good as this is a rather internal structure. One way would be to just fix that on HEAD and call it a day, but I'm aware of heavy logirep users whose workloads would be able to see memory bloating because they maintain WAL senders around. Thoughts or comments? -- Michael
0,3968872,4014248 100,4011832,4014248 200,3997928,4001048 300,4022224,4001048 400,4040608,3984520 500,4042552,3999448 600,4053576,3999448 700,4067464,3938632 800,4068992,3998120 900,4080632,3998120 1000,4090056,3890312 1100,4083968,3998120 1200,4106728,3998120 1300,4119200,3949688 1400,4104680,3998120 1500,4134536,3998120 1600,4144408,3905128 1700,4137608,3998120 1800,4159744,3998120 1900,4171488,3952232 2000,4170472,3998120 2100,4186584,3998120 2200,4196520,3946040 2300,4192296,3998120 2400,4211712,3998120 2500,4221792,3916296 2600,4214752,3998120 2700,4237344,3998120 2800,4245656,3916184 2900,4244736,3998120 3000,4260920,3998120 3100,4268488,3937304 3200,4281000,3998120 3300,4285752,3998120 3400,4296144,3903688 3500,4294672,3998120 3600,4310760,3996904 3700,4320624,3945928 3800,4313824,3996904 3900,4335768,3998120 4000,4345776,3940952 4100,4355952,3998120 4200,4360992,3998120 4300,4370616,3898712 4400,4360504,3998120 4500,4385616,3998120 4600,4394168,3952120 4700,4394104,3998120 4800,4408280,3998120 4900,4418976,3931000
diff --git a/src/include/catalog/pg_publication.h b/src/include/catalog/pg_publication.h index 9a83a72d6b..809bd1f0a6 100644 --- a/src/include/catalog/pg_publication.h +++ b/src/include/catalog/pg_publication.h @@ -103,7 +103,7 @@ typedef struct PublicationDesc typedef struct Publication { Oid oid; - char *name; + char name[NAMEDATALEN]; bool alltables; bool pubviaroot; bool pubgencols; diff --git a/src/backend/catalog/pg_publication.c b/src/backend/catalog/pg_publication.c index 9bbb60463f..8f4e3b83f1 100644 --- a/src/backend/catalog/pg_publication.c +++ b/src/backend/catalog/pg_publication.c @@ -1061,7 +1061,7 @@ GetPublication(Oid pubid) pub = (Publication *) palloc(sizeof(Publication)); pub->oid = pubid; - pub->name = pstrdup(NameStr(pubform->pubname)); + strlcpy(pub->name, NameStr(pubform->pubname), NAMEDATALEN); pub->alltables = pubform->puballtables; pub->pubactions.pubinsert = pubform->pubinsert; pub->pubactions.pubupdate = pubform->pubupdate;
signature.asc
Description: PGP signature