Hi Shaheed, I think you must have already analyzed the outcome of queries on pg_replication_slots, pg_current_wal_lsn(), pg_stat_subscription etc. I could find a query SELECT pg_size_pretty(pg_wal_lsn_diff('<publisher_restart_lsn>', '<subscriber_replayed_lsn>'));
As a side note if you want to see what has been applied to subscribers vs what exists on publisher then here is something from my previous experience. We used to have a Data Validation tool for checking tables/rows across publisher/subscriber. We also used pg_dump for another tool that was meant for making copies of schemas. Regards, Muhammad Ikram On Mon, Sep 2, 2024 at 12:42 PM Shaheed Haque <shaheedha...@gmail.com> wrote: > Hi Muhammad, > > On Mon, 2 Sep 2024, 07:08 Muhammad Ikram, <mmik...@gmail.com> wrote: > >> Hi Shaheed, >> >> Maybe these considerations could help you or give any hint to the problem >> ? >> >> >> Check if wal_receiver_timeout being set to 0 could potentially cause >> issues, like not detecting network issues quickly enough. Consider >> re-evaluating this setting if you see connection issues. >> >> If you notice that some data is missing on subscriber then could you >> increase max_slot_wal_keep_size on publisher so that WALs are not deleted >> until they are applied on subscriber. >> >> Do you have flexibility to increase max_worker_processes and >> max_logical_replication_workers, work_mem and maintenance_work_mem on >> subscriber (In case bottleneck exists on subscriber) >> >> If there's significant lag, consider whether it might be more efficient >> to drop the subscription and re-initialize it from scratch using a new base >> backup, depending on the data volume and how long it might take for the >> existing replication to catch up. >> > > Thanks for the kind hints, I'll certainly look into those. > > My main interest however was with the "visibility" question, i.e. to get > an understanding of the gap between the two ends of a replication slot, > ideally in human terms (e.g. tables x records). > > I understand the difficulties of trying to produce a meaningful metric > that spans two (or more) systems but let's be honest, trying to diagnose > which knobs to tweak (whether in application, PG, the OS or the network) is > basically black magic when all we really have is a pair of opaque LSNs. > > > > >> >> Regards, >> Muhammad Ikram >> >> >> On Sun, Sep 1, 2024 at 9:22 PM Shaheed Haque <shaheedha...@gmail.com> >> wrote: >> >>> Since nobody more knowledgeable has replied... >>> >>> I'm very interested in this area and still surprised that there is no >>> official/convenient/standard way to approach this (see >>> https://www.postgresql.org/message-id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5WXCN9KjZOvuTNJaAAC_hg%40mail.gmail.com >>> ). >>> >>> Based partly on that thread, I ended up with a script that connects to >>> both ends of the replication, and basically loops while comparing the >>> counts in each table. >>> >>> On Fri, 30 Aug 2024, 12:38 Michael Jaskiewicz, <mjaskiew...@ghx.com> >>> wrote: >>> >>>> I've got two Postgres 13 databases on AWS RDS. >>>> >>>> - One is a master, the other a slave using logical replication. >>>> - Replication has fallen behind by about 350Gb. >>>> - The slave was maxed out in terms of CPU for the past four days >>>> because of some jobs that were ongoing so I'm not sure what logical >>>> replication was able to replicate during that time. >>>> - I killed those jobs and now CPU on the master and slave are both >>>> low. >>>> - I look at the subscriber via `select * from >>>> pg_stat_subscription;` and see that latest_end_lsn is advancing albeit >>>> very >>>> slowly. >>>> - The publisher says write/flush/replay lags are all 13 minutes >>>> behind but it's been like that for most of the day. >>>> - I see no errors in the logs on either the publisher or subscriber >>>> outside of some simple SQL errors that users have been making. >>>> - CloudWatch reports low CPU utilization, low I/O, and low network. >>>> >>>> >>>> >>>> Is there anything I can do here? Previously I set wal_receiver_timeout >>>> timeout to 0 because I had replication issues, and that helped things. I >>>> wish I had *some* visibility here to get any kind of confidence that >>>> it's going to pull through, but other than these lsn values and database >>>> logs, I'm not sure what to check. >>>> >>>> >>>> >>>> Sincerely, >>>> >>>> mj >>>> >>> >> >> -- >> Muhammad Ikram >> >> -- Muhammad Ikram