On Wed, Oct 27, 2021 at 7:02 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Thu, Oct 21, 2021 at 10:29 AM Masahiko Sawada <sawada.m...@gmail.com> > wrote: > > > > > > I've attached updated patches.
Thank you for the comments! > > Few comments: > ============== > 1. Is the patch cleaning tablesync error entries except via vacuum? If > not, can't we send a message to remove tablesync errors once tablesync > is successful (say when we reset skip_xid or when tablesync is > finished) or when we drop subscription? I think the same applies to > apply worker. I think we may want to track it in some way whether an > error has occurred before sending the message but relying completely > on a vacuum might be the recipe of bloat. I think in the case of a > drop subscription we can simply send the message as that is not a > frequent operation. I might be missing something here because in the > tests after drop subscription you are expecting the entries from the > view to get cleared Yes, I think we can have tablesync worker send a message to drop stats once tablesync is successful. But if we do that also when dropping a subscription, I think we need to do that only the transaction is committed since we can drop a subscription that doesn't have a replication slot and rollback the transaction. Probably we can send the message only when the subscritpion does have a replication slot. In other cases, we can remember the subscriptions being dropped and send the message to drop the statistics of them after committing the transaction but I’m not sure it’s worth having it. FWIW, we completely rely on pg_stat_vacuum_stats() for cleaning up the dead tables and functions. And we don't expect there are many subscriptions on the database. What do you think? > > 2. > + <row> > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>count</structfield> <type>uint8</type> > + </para> > + <para> > + Number of consecutive times the error occurred > + </para></entry> > > Shall we name this field as error_count as there will be other fields > in this view in the future that may not be directly related to the > error? Agreed. > > 3. > + > +CREATE VIEW pg_stat_subscription_workers AS > + SELECT > + e.subid, > + s.subname, > + e.subrelid, > + e.relid, > + e.command, > + e.xid, > + e.count, > + e.error_message, > + e.last_error_time, > + e.stats_reset > + FROM (SELECT > + oid as subid, > + NULL as relid > + FROM pg_subscription > + UNION ALL > + SELECT > + srsubid as subid, > + srrelid as relid > + FROM pg_subscription_rel > + WHERE srsubstate <> 'r') sr, > + LATERAL pg_stat_get_subscription_worker(sr.subid, sr.relid) e > > It might be better to use 'w' as an alias instead of 'e' as the > information is now not restricted to only errors. Agreed. > > 4. +# Test if the error reported on pg_subscription_workers view is expected. > > The view name is wrong in the above comment Fixed. > > 5. > +# Check if the view doesn't show any entries after dropping the > subscriptions. > +$node_subscriber->safe_psql( > + 'postgres', > + q[ > +DROP SUBSCRIPTION tap_sub; > +DROP SUBSCRIPTION tap_sub_streaming; > +]); > +$result = $node_subscriber->safe_psql('postgres', > + "SELECT count(1) FROM pg_stat_subscription_workers"); > +is($result, q(0), 'no error after dropping subscription'); > > Don't we need to wait after dropping the subscription and before > checking the view as there might be a slight delay in messages to get > cleared? I think the test always passes without waiting for the statistics to be updated since we fetch the subscription worker statistics from the stats collector based on the entries of pg_subscription catalog. So this test checks if statistics of already-dropped subscription doesn’t show up in the view after DROP SUBSCRIPTION, but does not check if the subscription worker statistics entry in the stats collector gets removed. The primary reason is that as I mentioned above, the patch relies on pgstat_vacuum_stat() for cleaning up the dead subscriptions. > > 7. > +# Create subscriptions. The table sync for test_tab2 on tap_sub will enter to > +# infinite error due to violating the unique constraint. > +my $appname = 'tap_sub'; > +$node_subscriber->safe_psql( > + 'postgres', > + "CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr > application_name=$appname' PUBLICATION tap_pub WITH (streaming = off, > two_phase = on);"); > +my $appname_streaming = 'tap_sub_streaming'; > +$node_subscriber->safe_psql( > + 'postgres', > + "CREATE SUBSCRIPTION tap_sub_streaming CONNECTION > '$publisher_connstr application_name=$appname_streaming' PUBLICATION > tap_pub_streaming WITH (streaming = on, two_phase = on);"); > + > +$node_publisher->wait_for_catchup($appname); > +$node_publisher->wait_for_catchup($appname_streaming); > > How can we ensure that subscriber would have caught up when one of the > tablesync workers is constantly in the error loop? Isn't it possible > that the subscriber didn't send the latest lsn feedback till the table > sync worker is finished? > I thought that even if tablesync for a table is still ongoing, the apply worker can apply commit records, update write LSN and flush LSN, and send the feedback to the wal sender. No? > 8. > +# Create subscriptions. The table sync for test_tab2 on tap_sub will enter to > +# infinite error due to violating the unique constraint. > > The second sentence of the comment can be written as: "The table sync > for test_tab2 on tap_sub will enter into infinite error loop due to > violating the unique constraint." Fixed. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/