Hi Shaheed, Maybe these considerations could help you or give any hint to the problem ?
Check if wal_receiver_timeout being set to 0 could potentially cause issues, like not detecting network issues quickly enough. Consider re-evaluating this setting if you see connection issues. If you notice that some data is missing on subscriber then could you increase max_slot_wal_keep_size on publisher so that WALs are not deleted until they are applied on subscriber. Do you have flexibility to increase max_worker_processes and max_logical_replication_workers, work_mem and maintenance_work_mem on subscriber (In case bottleneck exists on subscriber) If there's significant lag, consider whether it might be more efficient to drop the subscription and re-initialize it from scratch using a new base backup, depending on the data volume and how long it might take for the existing replication to catch up. Regards, Muhammad Ikram On Sun, Sep 1, 2024 at 9:22 PM Shaheed Haque <shaheedha...@gmail.com> wrote: > Since nobody more knowledgeable has replied... > > I'm very interested in this area and still surprised that there is no > official/convenient/standard way to approach this (see > https://www.postgresql.org/message-id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5WXCN9KjZOvuTNJaAAC_hg%40mail.gmail.com > ). > > Based partly on that thread, I ended up with a script that connects to > both ends of the replication, and basically loops while comparing the > counts in each table. > > On Fri, 30 Aug 2024, 12:38 Michael Jaskiewicz, <mjaskiew...@ghx.com> > wrote: > >> I've got two Postgres 13 databases on AWS RDS. >> >> - One is a master, the other a slave using logical replication. >> - Replication has fallen behind by about 350Gb. >> - The slave was maxed out in terms of CPU for the past four days >> because of some jobs that were ongoing so I'm not sure what logical >> replication was able to replicate during that time. >> - I killed those jobs and now CPU on the master and slave are both >> low. >> - I look at the subscriber via `select * from pg_stat_subscription;` >> and see that latest_end_lsn is advancing albeit very slowly. >> - The publisher says write/flush/replay lags are all 13 minutes >> behind but it's been like that for most of the day. >> - I see no errors in the logs on either the publisher or subscriber >> outside of some simple SQL errors that users have been making. >> - CloudWatch reports low CPU utilization, low I/O, and low network. >> >> >> >> Is there anything I can do here? Previously I set wal_receiver_timeout >> timeout to 0 because I had replication issues, and that helped things. I >> wish I had *some* visibility here to get any kind of confidence that >> it's going to pull through, but other than these lsn values and database >> logs, I'm not sure what to check. >> >> >> >> Sincerely, >> >> mj >> > -- Muhammad Ikram