On Wed, 13 Jan 2021 at 16:49, Bharath Rupireddy wrote:
> On Wed, Jan 13, 2021 at 11:27 AM Amit Kapila <amit.kapil...@gmail.com> wrote:
>>
>> On Wed, Jan 13, 2021 at 11:08 AM Bharath Rupireddy
>> <bharath.rupireddyforpostg...@gmail.com> wrote:
>> >
>> > On Wed, Jan 13, 2021 at 10:33 AM Amit Kapila <amit.kapil...@gmail.com> 
>> > wrote:
>> > >
>> > > On Tue, Jan 12, 2021 at 5:23 PM japin <japi...@hotmail.com> wrote:
>> > > >
>> > > > On Tue, 12 Jan 2021 at 19:32, Bharath Rupireddy wrote:
>> > > > > On Tue, Jan 12, 2021 at 4:47 PM Li Japin <japi...@hotmail.com> wrote:
>> > > > >> IIUC the logical replication only replicate the tables in 
>> > > > >> publication, I think
>> > > > >> when the tables that aren't in publication should not be replicated.
>> > > > >>
>> > > > >> Attached the patch that fixes it.  Thought?
>> > > > >
>> > > > > With that change, we don't get the behaviour that's stated in the
>> > > > > document - "The ADD TABLE and DROP TABLE clauses will add and remove
>> > > > > one or more tables from the publication. Note that adding tables to a
>> > > > > publication that is already subscribed to will require a ALTER
>> > > > > SUBSCRIPTION ... REFRESH PUBLICATION action on the subscribing side 
>> > > > > in
>> > > > > order to become effective" -
>> > > > > https://www.postgresql.org/docs/devel/sql-alterpublication.html.
>> > > > >
>> > > >
>> > > > The documentation only emphasize adding tables to a publication, not
>> > > > include dropping tables from a publication.
>> > > >
>> > >
>> > > Right and I think that is ensured by the subscriber by calling
>> > > should_apply_changes_for_rel() which won't return true unless the
>> > > newly added relation is not synced via Refresh Publication. So, this
>> > > means with or without this patch we should be sending the changes of
>> > > the newly published table from the publisher?
>> >
>> > Oh, my bad, alter subscription...refresh publication is required only when 
>> > the tables are added to the publisher. Patch by japin makes the walsender 
>> > process to stop sending the data to the subscriber/apply worker. The patch 
>> > is based on the idea of looking at the PUBLICATIONRELMAP in 
>> > get_rel_sync_entry when the entries have been invalidated in 
>> > rel_sync_cache_publication_cb because of alter publication...drop table.
>> > ,
>> > When the alter subscription...refresh publication is run on the 
>> > subscriber, the SUBSCRIPTIONRELMAP catalogue gets invalidated but the 
>> > corresponding cache entries in the LogicalRepRelMap which is used by 
>> > logicalrep_rel_open are not invalidated. LogicalRepRelMap is used to know 
>> > the relations that are associated with the subscription. But we seem to 
>> > have not taken care of invalidating those entries, though we have the 
>> > invalidation callback invalidate_syncing_table_states registered for 
>> > SUBSCRIPTIONRELMAP in ApplyWorkerMain. So, we miss to work on updating the 
>> > entries in LogicalRepRelMap.
>> >
>> > IMO, the ideal way to fix this issue is 1) stop the walsender sending the 
>> > changes to dropped tables, for this japin patch works 2) we must mark all 
>> > the LogicalRepRelMap entries as invalid in invalidate_syncing_table_states 
>> > so that in the next logicalrep_rel_open, if the entry is invalidated, then 
>> > we have to call GetSubscriptionRelState to get the latest state, as shown 
>> > in [1]. Likewise, we might also have to mark the cache entries invalid in  
>> > subscription_change_cb which is invalidation callback for pg_subscription
>> >
>>
>> Is the second point you described here is related to the original
>> bug-reported or will it cause some other problem?
>
> It can cause some other problems. I found this while investigating
> from the subscriber perspective why the subscriber is accepting the
> inserts even though the relation is removed from the
> pg_subscription_rel catalogue after refresh publication. I ended up
> finding that the cache entries are not being invalidated in
> invalidate_syncing_table_states.
>
> Having said that, the patch proposed by japin is enough to solve the
> bug reported here.
>
> While we are fixing the bug, I thought it's better to fix this as
> well, maybe as a 0002 patch? If okay, I can work on the patch and post
> it in a separate thread?
>
>> > Thoughts?
>> >
>> > [1] -
>> >     if (entry->state != SUBREL_STATE_READY || entry->invalid)
>> >         entry->state = GetSubscriptionRelState(MySubscription->oid,
>> >                                                entry->localreloid,
>> >                                                &entry->statelsn);
>> >
>> >    if (entry->invalid)
>> >        entry->invalid = false;
>> >
>> >     return entry;
>> >
>> > > I have another question on your patch which is why in some cases like
>> > > when we have not inserted in step-5 (as mentioned by you) the
>> > > following insertions are not sent. Is somehow we are setting the
>> > > pubactions as false in that case, if so, how?
>> >
>> > The reason is that the issue reported in this thread occurs - when we have 
>> > the walsender process running, RelationSyncCache is initialized, we 
>> > inserted some data into the table that's sent to the subscriber and the 
>> > table is dropped, we miss to set the pubactions to false in 
>> > get_rel_sync_entry, though the cache entries have been invalidated.
>> >
>> > In some cases, it works properly because the old walsender process was 
>> > stopped and when a new walsender is started, then the cache 
>> > RelationSyncCache gets initialized again and the pubactions will be set to 
>> > false in get_rel_sync_entry.
>> >
>>
>> Why is walsender process was getting stopped in one case but not in another?
>
> Both walsender and logical replication apply workers are getting
> stopped because of the default values of 60s for wal_sender_timeout
> and wal_receiver_timeout respectively. To analyze further, I increased
> both timeouts to 600s.
>
> Here's what happening:
>
> Note that for the sake of simplicity, I'm skipping the create
> publication, subscription, initial insertion statements, I'm starting
> from alter publication ... drop table statement.
>
> case 1 - issue reported in this thread is observed:
> 1) on publisher - alter publication ... drop table t1; and insert into
> t1 values(67);   Though the table is dropped from the publication, the
> walsender process publishes the inserts, which will be fixed by the
> patch proposed by japin.
>
> 2) on subscriber - on the insert into t1(1) on the publisher,
> logicalrep_relmap_update gets called in the subscriber because the
> publisher informs the subscriber that it has dropped the table from
> the publication. In logicalrep_relmap_update, the LogicalRepRelMap
> cache entry corresponding to the dropped relation is reset. Since the
> alter subscription ... refresh publication is not yet run, the insert
> is taken to the target table and the LogicalRepRelMap cache entry for
> that table is updated in logicalrep_rel_open, it's state is read from
> the catalogues using GetSubscriptionRelState, because of the entry
> reset in logicalrep_relmap_update, entry->state is
> SUBREL_STATE_UNKNOWN. After GetSubscriptionRelState, entry->state
> becomes SUBREL_STATE_READY
>     if (entry->state != SUBREL_STATE_READY)
>         entry->state = GetSubscriptionRelState(MySubscription->oid,
>                                                entry->localreloid,
>                                                &entry->statelsn);
>
> 3) on subscriber - run alter subscription ... refresh publication, the
> dropped relation entry is removed from the pg_subscription_rel
> catalogue and invalidate_syncing_table_states gets called. But note
> that we have not invalidated the LogicalRepRelMap. The
> LogicalRepRelMap cache entry for the table remains what we have set in
> (2) i.e. entry->state is SUBREL_STATE_READY.
>
> 4) on publisher - insert into t1 values(68); as mentioned in (1),
> walsender process keeps sending the inserts.
>
> 5) on subscriber - apply_handle_insert is called and the
> logicalrep_rel_open will not call GetSubscriptionRelState to get the
> new catalogue entry for pg_subscription_rel (3), because the
> entry->state is still SUBREL_STATE_READY which was set (2), so it ends
> up in applying the incoming inserts, until the GetSubscriptionRelState
> is called for that entry which happens either the apply worker stops
> and restarts due to timeout or another change to the publication
> happens in the publisher so that logicalrep_relmap_update is called on
> the subscriber. If we had marked all the LogicalRepRelMap cache
> entries as invalid in the invalidate_syncing_table_states callback as
> pointed out by me in the earlier mail, then this problem will be
> solved.
>
> case 2 - issue is not observed:
> 1) on publisher - alter publication ... drop table t1; no inserts into
> the table t1, subscriber will not receive logicalrep_relmap_update
> yet.
>
> 2) on subscriber - run alter subscription ... refresh publication, the
> dropped relation entry is removed from the pg_subscription_rel
> catalogue and invalidate_syncing_table_states gets called. But note
> that we have not invalidated the LogicalRepRelMap. The
> LogicalRepRelMap cache entry for the table remains the same.
>
> 3) on publisher - insert into t1 values(67);   Though the table is
> dropped from the publication (1), the walsender process publishes the
> inserts, which will be fixed by the patch proposed by japin.
>
> 4) on subscriber - on the insert into t1(3) on the publisher,
> logicalrep_relmap_update gets called in the subscriber because the
> publisher informs the subscriber that it has dropped the table from
> the publication. In logicalrep_relmap_update, the LogicalRepRelMap
> cache entry corresponding to the dropped relation is reset, because of
> which the next logicalrep_rel_open will call GetSubscriptionRelState
> (as entry->state is SUBREL_STATE_UNKNOWN due to the reset in
> logicalrep_relmap_update). GetSubscriptionRelState will return
> SUBREL_STATE_UNKNOWN, because it doesn't find the tuple in the updated
> pg_subscription_rel catalogue because of the alter subscription ...
> refresh publication would have removed tuple related to the table from
> the pg_subscription_rel. So, the inserts are ignored because the
> should_apply_changes_for_rel returns false after the
> logicalrep_rel_open in apply_handle_insert.
>     /* Try finding the mapping. */
>     tup = SearchSysCache2(SUBSCRIPTIONRELMAP,
>                           ObjectIdGetDatum(relid),
>                           ObjectIdGetDatum(subid));
>
>     if (!HeapTupleIsValid(tup))
>     {
>         *sublsn = InvalidXLogRecPtr;
>         return SUBREL_STATE_UNKNOWN;
>     }
>
> 5) Here on, all further inserts into the publication table are ignored
> by the subscriber because of the same reason (4). So, the issue
> reported in this thread doesn't occur.
>
> I'm sorry for the huge write up. I hope I clarified the point this time.
>

Yes! Very clearly.

> In summary, I feel we need to fix the publisher sending the inserts
> even though the table is dropped from the publication, that is the
> patch patch proposed by japin. This solves the bug reported in this
> thread.
>
> And also, it's good to have the LogicalRepRelMap invalidation fix as a
> 0002 patch in invalidate_syncing_table_states, subscription_change_cb
> and logicalrep_rel_open as proposed by me.
>
> Thoughts?
>

I think invalidate the LogicalRepRelMap is necessary.  If the table isn't in
subscription, can we remove the LogicalRepRelMapEntry from LogicalRepRelMap?

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.


Reply via email to