Hi Muhammad,

On Mon, 2 Sep 2024, 09:45 Muhammad Ikram, <mmik...@gmail.com> wrote:

> Hi Shaheed,
> I think you must have already analyzed the outcome of queries
> on pg_replication_slots,  pg_current_wal_lsn(), pg_stat_subscription etc. I
> could find a query SELECT
> pg_size_pretty(pg_wal_lsn_diff('<publisher_restart_lsn>',
> '<subscriber_replayed_lsn>'));
>

Yes. My point is that it is hard to go from byte numbers to table entries.

Aps a side note if you want to see what has been applied to subscribers vs
> what exists on publisher then here is something from my previous
> experience. We used to have a Data Validation tool for checking tables/rows
> across publisher/subscriber.
>

Ack. That's pretty much what I had to build.

We also used pg_dump for another tool that was meant for making copies of
> schemas.
>

I'm somewhat fortunate to have a simple use case where all I am doing is a
copy of the "old" deployment to a "new" deployment such that when the two
ends are in close sync, I can freeze traffic to the old deployment, pause
for any final catchup, and then run a Django migration on the new, before
switching on the new (thereby minimising the down time for the app).

What I found by just looking at LSN numbers was that the database LSN were
close but NOT the same. Once I built the tool, I was able to see which
tables were still in play, and saw that some previously overlooked
background timers were expiring, causing the activity.

Net result: the LSNs can tell you if you are not in sync, but not the
reason why. (Again, I understand that row counts worked for me, but might
not work for others).

Thanks for your kind help and pointers!


Regards,
> Muhammad Ikram
>
>
>
> On Mon, Sep 2, 2024 at 12:42 PM Shaheed Haque <shaheedha...@gmail.com>
> wrote:
>
>> Hi Muhammad,
>>
>> On Mon, 2 Sep 2024, 07:08 Muhammad Ikram, <mmik...@gmail.com> wrote:
>>
>>> Hi Shaheed,
>>>
>>> Maybe these considerations could help you or give any hint to the
>>> problem ?
>>>
>>>
>>> Check if wal_receiver_timeout being set to 0 could potentially cause
>>> issues, like not detecting network issues quickly enough. Consider
>>> re-evaluating this setting if you see connection issues.
>>>
>>> If you notice that some data is missing on subscriber then could you
>>> increase max_slot_wal_keep_size on publisher so that WALs are not deleted
>>> until they are applied on subscriber.
>>>
>>> Do you have flexibility to increase max_worker_processes and
>>> max_logical_replication_workers, work_mem and maintenance_work_mem on
>>> subscriber (In case bottleneck exists on subscriber)
>>>
>>> If there's significant lag, consider whether it might be more efficient
>>> to drop the subscription and re-initialize it from scratch using a new base
>>> backup, depending on the data volume and how long it might take for the
>>> existing replication to catch up.
>>>
>>
>> Thanks for the kind hints, I'll certainly look into those.
>>
>> My main interest however was with the "visibility" question, i.e. to get
>> an understanding of the gap between the two ends of a replication slot,
>> ideally in human terms (e.g. tables x records).
>>
>> I understand the difficulties of trying to produce a meaningful metric
>> that spans two (or more) systems but let's be honest, trying to diagnose
>> which knobs to tweak (whether in application, PG, the OS or the network) is
>> basically black magic when all we really have is a pair of opaque LSNs.
>>
>>
>>
>>
>>>
>>>  Regards,
>>> Muhammad Ikram
>>>
>>>
>>> On Sun, Sep 1, 2024 at 9:22 PM Shaheed Haque <shaheedha...@gmail.com>
>>> wrote:
>>>
>>>> Since nobody more knowledgeable has replied...
>>>>
>>>> I'm very interested in this area and still surprised that there is no
>>>> official/convenient/standard way to approach this (see
>>>> https://www.postgresql.org/message-id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5WXCN9KjZOvuTNJaAAC_hg%40mail.gmail.com
>>>> ).
>>>>
>>>> Based partly on that thread, I ended up with a script that connects to
>>>> both ends of the replication, and basically loops while comparing the
>>>> counts in each table.
>>>>
>>>> On Fri, 30 Aug 2024, 12:38 Michael Jaskiewicz, <mjaskiew...@ghx.com>
>>>> wrote:
>>>>
>>>>> I've got two Postgres 13 databases on AWS RDS.
>>>>>
>>>>>    - One is a master, the other a slave using logical replication.
>>>>>    - Replication has fallen behind by about 350Gb.
>>>>>    - The slave was maxed out in terms of CPU for the past four days
>>>>>    because of some jobs that were ongoing so I'm not sure what logical
>>>>>    replication was able to replicate during that time.
>>>>>    - I killed those jobs and now CPU on the master and slave are both
>>>>>    low.
>>>>>    - I look at the subscriber via `select * from
>>>>>    pg_stat_subscription;` and see that latest_end_lsn is advancing albeit 
>>>>> very
>>>>>    slowly.
>>>>>    - The publisher says write/flush/replay lags are all 13 minutes
>>>>>    behind but it's been like that for most of the day.
>>>>>    - I see no errors in the logs on either the publisher or
>>>>>    subscriber outside of some simple SQL errors that users have been 
>>>>> making.
>>>>>    - CloudWatch reports low CPU utilization, low I/O, and low
>>>>>    network.
>>>>>
>>>>>
>>>>>
>>>>> Is there anything I can do here? Previously I set wal_receiver_timeout
>>>>> timeout to 0 because I had replication issues, and that helped things. I
>>>>> wish I had *some* visibility here to get any kind of confidence that
>>>>> it's going to pull through, but other than these lsn values and database
>>>>> logs, I'm not sure what to check.
>>>>>
>>>>>
>>>>>
>>>>> Sincerely,
>>>>>
>>>>> mj
>>>>>
>>>>
>>>
>>> --
>>> Muhammad Ikram
>>>
>>>
>
> --
> Muhammad Ikram
>
>

Reply via email to