Hi Muhammad, On Mon, 2 Sep 2024, 09:45 Muhammad Ikram, <mmik...@gmail.com> wrote:
> Hi Shaheed, > I think you must have already analyzed the outcome of queries > on pg_replication_slots, pg_current_wal_lsn(), pg_stat_subscription etc. I > could find a query SELECT > pg_size_pretty(pg_wal_lsn_diff('<publisher_restart_lsn>', > '<subscriber_replayed_lsn>')); > Yes. My point is that it is hard to go from byte numbers to table entries. Aps a side note if you want to see what has been applied to subscribers vs > what exists on publisher then here is something from my previous > experience. We used to have a Data Validation tool for checking tables/rows > across publisher/subscriber. > Ack. That's pretty much what I had to build. We also used pg_dump for another tool that was meant for making copies of > schemas. > I'm somewhat fortunate to have a simple use case where all I am doing is a copy of the "old" deployment to a "new" deployment such that when the two ends are in close sync, I can freeze traffic to the old deployment, pause for any final catchup, and then run a Django migration on the new, before switching on the new (thereby minimising the down time for the app). What I found by just looking at LSN numbers was that the database LSN were close but NOT the same. Once I built the tool, I was able to see which tables were still in play, and saw that some previously overlooked background timers were expiring, causing the activity. Net result: the LSNs can tell you if you are not in sync, but not the reason why. (Again, I understand that row counts worked for me, but might not work for others). Thanks for your kind help and pointers! Regards, > Muhammad Ikram > > > > On Mon, Sep 2, 2024 at 12:42 PM Shaheed Haque <shaheedha...@gmail.com> > wrote: > >> Hi Muhammad, >> >> On Mon, 2 Sep 2024, 07:08 Muhammad Ikram, <mmik...@gmail.com> wrote: >> >>> Hi Shaheed, >>> >>> Maybe these considerations could help you or give any hint to the >>> problem ? >>> >>> >>> Check if wal_receiver_timeout being set to 0 could potentially cause >>> issues, like not detecting network issues quickly enough. Consider >>> re-evaluating this setting if you see connection issues. >>> >>> If you notice that some data is missing on subscriber then could you >>> increase max_slot_wal_keep_size on publisher so that WALs are not deleted >>> until they are applied on subscriber. >>> >>> Do you have flexibility to increase max_worker_processes and >>> max_logical_replication_workers, work_mem and maintenance_work_mem on >>> subscriber (In case bottleneck exists on subscriber) >>> >>> If there's significant lag, consider whether it might be more efficient >>> to drop the subscription and re-initialize it from scratch using a new base >>> backup, depending on the data volume and how long it might take for the >>> existing replication to catch up. >>> >> >> Thanks for the kind hints, I'll certainly look into those. >> >> My main interest however was with the "visibility" question, i.e. to get >> an understanding of the gap between the two ends of a replication slot, >> ideally in human terms (e.g. tables x records). >> >> I understand the difficulties of trying to produce a meaningful metric >> that spans two (or more) systems but let's be honest, trying to diagnose >> which knobs to tweak (whether in application, PG, the OS or the network) is >> basically black magic when all we really have is a pair of opaque LSNs. >> >> >> >> >>> >>> Regards, >>> Muhammad Ikram >>> >>> >>> On Sun, Sep 1, 2024 at 9:22 PM Shaheed Haque <shaheedha...@gmail.com> >>> wrote: >>> >>>> Since nobody more knowledgeable has replied... >>>> >>>> I'm very interested in this area and still surprised that there is no >>>> official/convenient/standard way to approach this (see >>>> https://www.postgresql.org/message-id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5WXCN9KjZOvuTNJaAAC_hg%40mail.gmail.com >>>> ). >>>> >>>> Based partly on that thread, I ended up with a script that connects to >>>> both ends of the replication, and basically loops while comparing the >>>> counts in each table. >>>> >>>> On Fri, 30 Aug 2024, 12:38 Michael Jaskiewicz, <mjaskiew...@ghx.com> >>>> wrote: >>>> >>>>> I've got two Postgres 13 databases on AWS RDS. >>>>> >>>>> - One is a master, the other a slave using logical replication. >>>>> - Replication has fallen behind by about 350Gb. >>>>> - The slave was maxed out in terms of CPU for the past four days >>>>> because of some jobs that were ongoing so I'm not sure what logical >>>>> replication was able to replicate during that time. >>>>> - I killed those jobs and now CPU on the master and slave are both >>>>> low. >>>>> - I look at the subscriber via `select * from >>>>> pg_stat_subscription;` and see that latest_end_lsn is advancing albeit >>>>> very >>>>> slowly. >>>>> - The publisher says write/flush/replay lags are all 13 minutes >>>>> behind but it's been like that for most of the day. >>>>> - I see no errors in the logs on either the publisher or >>>>> subscriber outside of some simple SQL errors that users have been >>>>> making. >>>>> - CloudWatch reports low CPU utilization, low I/O, and low >>>>> network. >>>>> >>>>> >>>>> >>>>> Is there anything I can do here? Previously I set wal_receiver_timeout >>>>> timeout to 0 because I had replication issues, and that helped things. I >>>>> wish I had *some* visibility here to get any kind of confidence that >>>>> it's going to pull through, but other than these lsn values and database >>>>> logs, I'm not sure what to check. >>>>> >>>>> >>>>> >>>>> Sincerely, >>>>> >>>>> mj >>>>> >>>> >>> >>> -- >>> Muhammad Ikram >>> >>> > > -- > Muhammad Ikram > >