On Wed, 13 Jan 2021 at 16:49, Bharath Rupireddy wrote: > On Wed, Jan 13, 2021 at 11:27 AM Amit Kapila <amit.kapil...@gmail.com> wrote: >> >> On Wed, Jan 13, 2021 at 11:08 AM Bharath Rupireddy >> <bharath.rupireddyforpostg...@gmail.com> wrote: >> > >> > On Wed, Jan 13, 2021 at 10:33 AM Amit Kapila <amit.kapil...@gmail.com> >> > wrote: >> > > >> > > On Tue, Jan 12, 2021 at 5:23 PM japin <japi...@hotmail.com> wrote: >> > > > >> > > > On Tue, 12 Jan 2021 at 19:32, Bharath Rupireddy wrote: >> > > > > On Tue, Jan 12, 2021 at 4:47 PM Li Japin <japi...@hotmail.com> wrote: >> > > > >> IIUC the logical replication only replicate the tables in >> > > > >> publication, I think >> > > > >> when the tables that aren't in publication should not be replicated. >> > > > >> >> > > > >> Attached the patch that fixes it. Thought? >> > > > > >> > > > > With that change, we don't get the behaviour that's stated in the >> > > > > document - "The ADD TABLE and DROP TABLE clauses will add and remove >> > > > > one or more tables from the publication. Note that adding tables to a >> > > > > publication that is already subscribed to will require a ALTER >> > > > > SUBSCRIPTION ... REFRESH PUBLICATION action on the subscribing side >> > > > > in >> > > > > order to become effective" - >> > > > > https://www.postgresql.org/docs/devel/sql-alterpublication.html. >> > > > > >> > > > >> > > > The documentation only emphasize adding tables to a publication, not >> > > > include dropping tables from a publication. >> > > > >> > > >> > > Right and I think that is ensured by the subscriber by calling >> > > should_apply_changes_for_rel() which won't return true unless the >> > > newly added relation is not synced via Refresh Publication. So, this >> > > means with or without this patch we should be sending the changes of >> > > the newly published table from the publisher? >> > >> > Oh, my bad, alter subscription...refresh publication is required only when >> > the tables are added to the publisher. Patch by japin makes the walsender >> > process to stop sending the data to the subscriber/apply worker. The patch >> > is based on the idea of looking at the PUBLICATIONRELMAP in >> > get_rel_sync_entry when the entries have been invalidated in >> > rel_sync_cache_publication_cb because of alter publication...drop table. >> > , >> > When the alter subscription...refresh publication is run on the >> > subscriber, the SUBSCRIPTIONRELMAP catalogue gets invalidated but the >> > corresponding cache entries in the LogicalRepRelMap which is used by >> > logicalrep_rel_open are not invalidated. LogicalRepRelMap is used to know >> > the relations that are associated with the subscription. But we seem to >> > have not taken care of invalidating those entries, though we have the >> > invalidation callback invalidate_syncing_table_states registered for >> > SUBSCRIPTIONRELMAP in ApplyWorkerMain. So, we miss to work on updating the >> > entries in LogicalRepRelMap. >> > >> > IMO, the ideal way to fix this issue is 1) stop the walsender sending the >> > changes to dropped tables, for this japin patch works 2) we must mark all >> > the LogicalRepRelMap entries as invalid in invalidate_syncing_table_states >> > so that in the next logicalrep_rel_open, if the entry is invalidated, then >> > we have to call GetSubscriptionRelState to get the latest state, as shown >> > in [1]. Likewise, we might also have to mark the cache entries invalid in >> > subscription_change_cb which is invalidation callback for pg_subscription >> > >> >> Is the second point you described here is related to the original >> bug-reported or will it cause some other problem? > > It can cause some other problems. I found this while investigating > from the subscriber perspective why the subscriber is accepting the > inserts even though the relation is removed from the > pg_subscription_rel catalogue after refresh publication. I ended up > finding that the cache entries are not being invalidated in > invalidate_syncing_table_states. > > Having said that, the patch proposed by japin is enough to solve the > bug reported here. > > While we are fixing the bug, I thought it's better to fix this as > well, maybe as a 0002 patch? If okay, I can work on the patch and post > it in a separate thread? > >> > Thoughts? >> > >> > [1] - >> > if (entry->state != SUBREL_STATE_READY || entry->invalid) >> > entry->state = GetSubscriptionRelState(MySubscription->oid, >> > entry->localreloid, >> > &entry->statelsn); >> > >> > if (entry->invalid) >> > entry->invalid = false; >> > >> > return entry; >> > >> > > I have another question on your patch which is why in some cases like >> > > when we have not inserted in step-5 (as mentioned by you) the >> > > following insertions are not sent. Is somehow we are setting the >> > > pubactions as false in that case, if so, how? >> > >> > The reason is that the issue reported in this thread occurs - when we have >> > the walsender process running, RelationSyncCache is initialized, we >> > inserted some data into the table that's sent to the subscriber and the >> > table is dropped, we miss to set the pubactions to false in >> > get_rel_sync_entry, though the cache entries have been invalidated. >> > >> > In some cases, it works properly because the old walsender process was >> > stopped and when a new walsender is started, then the cache >> > RelationSyncCache gets initialized again and the pubactions will be set to >> > false in get_rel_sync_entry. >> > >> >> Why is walsender process was getting stopped in one case but not in another? > > Both walsender and logical replication apply workers are getting > stopped because of the default values of 60s for wal_sender_timeout > and wal_receiver_timeout respectively. To analyze further, I increased > both timeouts to 600s. > > Here's what happening: > > Note that for the sake of simplicity, I'm skipping the create > publication, subscription, initial insertion statements, I'm starting > from alter publication ... drop table statement. > > case 1 - issue reported in this thread is observed: > 1) on publisher - alter publication ... drop table t1; and insert into > t1 values(67); Though the table is dropped from the publication, the > walsender process publishes the inserts, which will be fixed by the > patch proposed by japin. > > 2) on subscriber - on the insert into t1(1) on the publisher, > logicalrep_relmap_update gets called in the subscriber because the > publisher informs the subscriber that it has dropped the table from > the publication. In logicalrep_relmap_update, the LogicalRepRelMap > cache entry corresponding to the dropped relation is reset. Since the > alter subscription ... refresh publication is not yet run, the insert > is taken to the target table and the LogicalRepRelMap cache entry for > that table is updated in logicalrep_rel_open, it's state is read from > the catalogues using GetSubscriptionRelState, because of the entry > reset in logicalrep_relmap_update, entry->state is > SUBREL_STATE_UNKNOWN. After GetSubscriptionRelState, entry->state > becomes SUBREL_STATE_READY > if (entry->state != SUBREL_STATE_READY) > entry->state = GetSubscriptionRelState(MySubscription->oid, > entry->localreloid, > &entry->statelsn); > > 3) on subscriber - run alter subscription ... refresh publication, the > dropped relation entry is removed from the pg_subscription_rel > catalogue and invalidate_syncing_table_states gets called. But note > that we have not invalidated the LogicalRepRelMap. The > LogicalRepRelMap cache entry for the table remains what we have set in > (2) i.e. entry->state is SUBREL_STATE_READY. > > 4) on publisher - insert into t1 values(68); as mentioned in (1), > walsender process keeps sending the inserts. > > 5) on subscriber - apply_handle_insert is called and the > logicalrep_rel_open will not call GetSubscriptionRelState to get the > new catalogue entry for pg_subscription_rel (3), because the > entry->state is still SUBREL_STATE_READY which was set (2), so it ends > up in applying the incoming inserts, until the GetSubscriptionRelState > is called for that entry which happens either the apply worker stops > and restarts due to timeout or another change to the publication > happens in the publisher so that logicalrep_relmap_update is called on > the subscriber. If we had marked all the LogicalRepRelMap cache > entries as invalid in the invalidate_syncing_table_states callback as > pointed out by me in the earlier mail, then this problem will be > solved. > > case 2 - issue is not observed: > 1) on publisher - alter publication ... drop table t1; no inserts into > the table t1, subscriber will not receive logicalrep_relmap_update > yet. > > 2) on subscriber - run alter subscription ... refresh publication, the > dropped relation entry is removed from the pg_subscription_rel > catalogue and invalidate_syncing_table_states gets called. But note > that we have not invalidated the LogicalRepRelMap. The > LogicalRepRelMap cache entry for the table remains the same. > > 3) on publisher - insert into t1 values(67); Though the table is > dropped from the publication (1), the walsender process publishes the > inserts, which will be fixed by the patch proposed by japin. > > 4) on subscriber - on the insert into t1(3) on the publisher, > logicalrep_relmap_update gets called in the subscriber because the > publisher informs the subscriber that it has dropped the table from > the publication. In logicalrep_relmap_update, the LogicalRepRelMap > cache entry corresponding to the dropped relation is reset, because of > which the next logicalrep_rel_open will call GetSubscriptionRelState > (as entry->state is SUBREL_STATE_UNKNOWN due to the reset in > logicalrep_relmap_update). GetSubscriptionRelState will return > SUBREL_STATE_UNKNOWN, because it doesn't find the tuple in the updated > pg_subscription_rel catalogue because of the alter subscription ... > refresh publication would have removed tuple related to the table from > the pg_subscription_rel. So, the inserts are ignored because the > should_apply_changes_for_rel returns false after the > logicalrep_rel_open in apply_handle_insert. > /* Try finding the mapping. */ > tup = SearchSysCache2(SUBSCRIPTIONRELMAP, > ObjectIdGetDatum(relid), > ObjectIdGetDatum(subid)); > > if (!HeapTupleIsValid(tup)) > { > *sublsn = InvalidXLogRecPtr; > return SUBREL_STATE_UNKNOWN; > } > > 5) Here on, all further inserts into the publication table are ignored > by the subscriber because of the same reason (4). So, the issue > reported in this thread doesn't occur. > > I'm sorry for the huge write up. I hope I clarified the point this time. >
Yes! Very clearly. > In summary, I feel we need to fix the publisher sending the inserts > even though the table is dropped from the publication, that is the > patch patch proposed by japin. This solves the bug reported in this > thread. > > And also, it's good to have the LogicalRepRelMap invalidation fix as a > 0002 patch in invalidate_syncing_table_states, subscription_change_cb > and logicalrep_rel_open as proposed by me. > > Thoughts? > I think invalidate the LogicalRepRelMap is necessary. If the table isn't in subscription, can we remove the LogicalRepRelMapEntry from LogicalRepRelMap? -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.