Since nobody more knowledgeable has replied... I'm very interested in this area and still surprised that there is no official/convenient/standard way to approach this (see https://www.postgresql.org/message-id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5WXCN9KjZOvuTNJaAAC_hg%40mail.gmail.com ).
Based partly on that thread, I ended up with a script that connects to both ends of the replication, and basically loops while comparing the counts in each table. On Fri, 30 Aug 2024, 12:38 Michael Jaskiewicz, <mjaskiew...@ghx.com> wrote: > I've got two Postgres 13 databases on AWS RDS. > > - One is a master, the other a slave using logical replication. > - Replication has fallen behind by about 350Gb. > - The slave was maxed out in terms of CPU for the past four days > because of some jobs that were ongoing so I'm not sure what logical > replication was able to replicate during that time. > - I killed those jobs and now CPU on the master and slave are both low. > - I look at the subscriber via `select * from pg_stat_subscription;` > and see that latest_end_lsn is advancing albeit very slowly. > - The publisher says write/flush/replay lags are all 13 minutes behind > but it's been like that for most of the day. > - I see no errors in the logs on either the publisher or subscriber > outside of some simple SQL errors that users have been making. > - CloudWatch reports low CPU utilization, low I/O, and low network. > > > > Is there anything I can do here? Previously I set wal_receiver_timeout > timeout to 0 because I had replication issues, and that helped things. I > wish I had *some* visibility here to get any kind of confidence that it's > going to pull through, but other than these lsn values and database logs, > I'm not sure what to check. > > > > Sincerely, > > mj >