On Tue, Feb 7, 2023 at 6:38 PM sunyuc...@gmail.com <sunyuc...@gmail.com> wrote:
> Hi there, > > I am using PG 14.14 on both primary and secondary DB on AWS, setup > using a logical replication, I'm having trouble with huge replication > lag. > > My setup is as follows: > > P1 - physical - P1-R > | (logical) > P2 - physical - P2-R > > > The lag between P1 & P1-R , P2 & P2-R are both minimal , less than > 10seconds , but I'm seeing several hours of lag between P1 & P2 .We > currently have 1 pub/sub that covers about 100 tables. > > Here is the output from P1 showing there is a Lag of at least 80GB > (keep growing) > > > SELECT now() AS CURRENT_TIME, > slot_name, > active, > active_pid,confirmed_flush_lsn, > pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), > confirmed_flush_lsn)) AS diff_size, > pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS > diff_bytes > FROM pg_replication_slots > WHERE slot_type = 'logical'; > -[ RECORD 1 ]-------+--------------------------- > current_time | 2023-02-07 23:26:00.733+00 > slot_name | upgrade > active | t > active_pid | 5180 > confirmed_flush_lsn | 26B09/8C08C610 > diff_size | 81 GB > diff_bytes | 86573472240 > > Here is what pg_stat_replication shows: note that the write_lag is > very high: we previously had to set wal_sender_timeout to 0, otherwise > the logical replication work keep exiting and fail. > > > select * from pg_stat_replication; > -[ RECORD 1 ]----+------------------------------ > pid | xxx > usesysid | xxx > usename | dev > application_name | upgrade_target > client_addr | 10.xxx > client_hostname | > client_port | 27404 > backend_start | 2023-02-07 23:02:39.228572+00 > backend_xmin | 2909692747 > state | catchup > sent_lsn | 26B09/8C08C610 > write_lsn | 26B09/840514C0 > flush_lsn | 26B09/840514C0 > replay_lsn | 26B09/840514C0 > write_lag | 00:07:03.60362 > flush_lag | 00:07:03.60362 > replay_lag | 00:07:03.60362 > sync_priority | 0 > sync_state | async > reply_time | 2023-02-07 23:22:08.245066+00 > > So, the problem seems to be like P2 takes a very long time to apply a > chunk, Mow, my question is why? From AWS performance insight, it > looks like there is only 1 process running on P2, that is 100% blocked > by CPU. I don't really understand why it is being blocked by the CPU, > and any help in further debugging is much appreciated. > > Cheers. > Hi Sir, What is the replica identity being used on the tables? Are any of the tables using REPLICA IDENTITY FULL ? How many tables are being replicated?