On Tue, Feb 7, 2023 at 6:38 PM sunyuc...@gmail.com <sunyuc...@gmail.com>
wrote:

> Hi there,
>
> I am using PG 14.14 on both primary and secondary DB on AWS, setup
> using a logical replication, I'm having trouble with huge replication
> lag.
>
> My setup is as follows:
>
> P1 - physical - P1-R
>   | (logical)
> P2 - physical - P2-R
>
>
> The lag between P1 & P1-R , P2 & P2-R are both minimal , less than
> 10seconds , but I'm seeing several hours of lag between P1 & P2  .We
> currently have 1 pub/sub that covers about 100 tables.
>
> Here is the output from P1 showing there is a Lag of at least 80GB
> (keep growing)
>
> > SELECT now() AS CURRENT_TIME,
>        slot_name,
>        active,
>        active_pid,confirmed_flush_lsn,
>        pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(),
> confirmed_flush_lsn)) AS diff_size,
>        pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS
> diff_bytes
> FROM pg_replication_slots
> WHERE slot_type = 'logical';
> -[ RECORD 1 ]-------+---------------------------
> current_time        | 2023-02-07 23:26:00.733+00
> slot_name           | upgrade
> active              | t
> active_pid          | 5180
> confirmed_flush_lsn | 26B09/8C08C610
> diff_size           | 81 GB
> diff_bytes          | 86573472240
>
> Here is what pg_stat_replication shows: note that the write_lag is
> very high: we previously had to set wal_sender_timeout to 0, otherwise
> the logical replication work keep exiting and fail.
>
> > select * from pg_stat_replication;
> -[ RECORD 1 ]----+------------------------------
> pid              | xxx
> usesysid         | xxx
> usename          | dev
> application_name | upgrade_target
> client_addr      | 10.xxx
> client_hostname  |
> client_port      | 27404
> backend_start    | 2023-02-07 23:02:39.228572+00
> backend_xmin     | 2909692747
> state            | catchup
> sent_lsn         | 26B09/8C08C610
> write_lsn        | 26B09/840514C0
> flush_lsn        | 26B09/840514C0
> replay_lsn       | 26B09/840514C0
> write_lag        | 00:07:03.60362
> flush_lag        | 00:07:03.60362
> replay_lag       | 00:07:03.60362
> sync_priority    | 0
> sync_state       | async
> reply_time       | 2023-02-07 23:22:08.245066+00
>
> So, the problem seems to be like P2 takes a very long time to apply a
> chunk,  Mow, my question is why?  From AWS performance insight, it
> looks like there is only 1 process running on P2, that is 100% blocked
> by CPU. I don't really understand why it is being blocked by the CPU,
> and any help in further debugging is much appreciated.
>
> Cheers.
>

Hi Sir,

What is the replica identity being used on the tables?  Are any of the
tables using  REPLICA IDENTITY FULL ?

How many tables are being replicated?

Reply via email to