On Fri, Dec 6, 2019 at 5:30 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Mon, Nov 25, 2019 at 8:25 PM Jehan-Guillaume de Rorthais > <j...@dalibo.com> wrote: > > > > On Wed, 6 Nov 2019 14:34:38 +0100 > > Peter Eisentraut <peter.eisentr...@2ndquadrant.com> wrote: > > > > > On 2019-11-05 17:29, Jehan-Guillaume de Rorthais wrote: > > > > My best bet so far is that logicalrep_relmap_invalidate_cb is not called > > > > after the DDL on the subscriber so the relmap cache is not invalidated. > > > > So > > > > we end up with slot->tts_tupleDescriptor->natts superior than > > > > rel->remoterel->natts in slot_store_cstrings, leading to the overflow on > > > > attrmap and the sigsev. > > > > > > It looks like something like that is happening. But it shouldn't. > > > Different table schemas on publisher and subscriber are well supported, > > > so this must be an edge case of some kind. Please continue investigating. > > > > I've been able to find the origin of the crash, but it was a long journey. > > > > <debugger hard life> > > > > I was unable to debug using gdb record because of this famous error: > > > > Process record does not support instruction 0xc5 at address > > 0x1482758a4b30. > > > > Program stopped. > > __memset_avx2_unaligned_erms () > > at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:168 > > 168 ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No such > > file or directory. > > > > Trying with rr, I had constant "stack depth limit exceeded", even with > > unlimited one. Does it worth opening a discussion or a wiki page about > > these > > tools? Peter, it looks like you have some experience with rr, any > > feedback? > > > > Finally, Julien Rouhaud spend some time with me after work hours, > > a,swering > > my questions about some parts of the code and pointed me to the excellent > > backtrace_functions GUC addition few days ago. This finally did the trick > > to > > find out what was happening. Many thanks Julien! > > > > </debugger hard life> > > > > Back to the bug itself. Consider a working logical replication with constant > > update/insert activity, eg. pgbench running against provider. > > > > Now, on the subscriber side, a session issue an "ALTER TABLE ADD > > COLUMN" on a subscribed table, eg. pgbench_branches. A cache invalidation > > message is then pending for this table. > > > > In the meantime, the logical replication worker receive an UPDATE to apply. > > It > > opens the local relation using "logicalrep_rel_open". It finds the related > > entry in LogicalRepRelMap is valid, so it does not update its attrmap > > and directly opens and locks the local relation: > > > > - /* Try to find and lock the relation by name. */ > + /* Try to find the relation by name */ > relid = RangeVarGetRelid(makeRangeVar(remoterel->nspname,\ > remoterel->relname, -1), > - lockmode, true); > + NoLock, true); > > I think we can't do this because it could lead to locking the wrong > reloid. See RangeVarGetRelidExtended. It ensures that after locking > the relation (which includes accepting invalidation messages), that > the reloid is correct. I think changing the code in the way you are > suggesting can lead to locking incorrect reloid. >
I have made changes to fix the comment provided. The patch for the same is attached. Could not add a test case for this scenario is based on timing issue. Thoughts? Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
From 9bec5e5250c5af034096ed9ab966c2c237518976 Mon Sep 17 00:00:00 2001 From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>, vignesh.c <vignesh21@gmail.com> Date: Fri, 13 Dec 2019 11:59:13 +0530 Subject: [PATCH] Fix subscriber invalid memory access on DDL Before this patch, the attrmap structure mapping the local fields to remote ones for a subscribed relation was rebuild before handling pending invalidation messages and potential relcache updates. This was leading to an invalid memory access by overflowing the attrmap when building the related tuple slot received from the provider. --- src/backend/replication/logical/relation.c | 53 ++++++++++++++++++++++-------- 1 file changed, 39 insertions(+), 14 deletions(-) diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c index b386f84..9009ebc 100644 --- a/src/backend/replication/logical/relation.c +++ b/src/backend/replication/logical/relation.c @@ -220,6 +220,8 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode) { LogicalRepRelMapEntry *entry; bool found; + Oid relid = InvalidOid; + LogicalRepRelation *remoterel; if (LogicalRepRelMap == NULL) logicalrep_relmap_init(); @@ -232,20 +234,24 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode) elog(ERROR, "no relation map entry for remote relation ID %u", remoteid); - /* Need to update the local cache? */ - if (!OidIsValid(entry->localreloid)) - { - Oid relid; - int i; - int found; - Bitmapset *idkey; - TupleDesc desc; - LogicalRepRelation *remoterel; - MemoryContext oldctx; + remoterel = &entry->remoterel; - remoterel = &entry->remoterel; + /* + * entry->localreloid is set to InvalidOid when the local relation is + * either unknown or invalidated. + * Moreover, when opening and locking a relation, pending invalidation + * messages are processed. + * Because of that, we first need to open the local relation, then rebuild + * the mapping of local attribute if the relation is invalidated. + */ - /* Try to find and lock the relation by name. */ + /* + * Unknown or invalidated local relation. First look for its oid and open + * it. We will build the mapping of local attributes after. + */ + if (!OidIsValid(entry->localreloid)) + { + /* Try to find the relation by name */ relid = RangeVarGetRelid(makeRangeVar(remoterel->nspname, remoterel->relname, -1), lockmode, true); @@ -256,6 +262,27 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode) remoterel->nspname, remoterel->relname))); entry->localrel = table_open(relid, NoLock); + } + else + { + relid = entry->localreloid; + entry->localrel = table_open(entry->localreloid, lockmode); + } + + /* + * Pending invalidation messages might have been processed while grabbing + * the lock during table_open, applying some DDL operations that are then + * reflected in the relcache. + * Because of that, we need to rebuild the mapping of local attribute after + * the relation is opened. + */ + if (!OidIsValid(entry->localreloid)) { + int found; + Bitmapset *idkey; + TupleDesc desc; + MemoryContext oldctx; + int i; + /* Check for supported relkind. */ CheckSubscriptionRelkind(entry->localrel->rd_rel->relkind, remoterel->nspname, remoterel->relname); @@ -350,8 +377,6 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode) entry->localreloid = relid; } - else - entry->localrel = table_open(entry->localreloid, lockmode); if (entry->state != SUBREL_STATE_READY) entry->state = GetSubscriptionRelState(MySubscription->oid, -- 1.8.3.1