This has gotten a bit weirder.

The replication is getting up to a specific LSN.  

Then it moves the sent_lsn to a number less than that, for a WAL file about an 
hour or more earlier and slowly walks forward again till it hit the same 
highwater mark.

So the replication seems to be stuck in some kind of a slow infinite loop.

Corrupted WAL file?  Any other ideas of what to look for?

TIA,

John

> On Aug 3, 2020, at 10:38 AM, John Ashmead <john.ashm...@ashmeadsoftware.com> 
> wrote:
> 
> I have logical replication setup from a factory in Zhuhai China to a data 
> warehouse in New Jersey. We are using postgresql 10.13 on both sides, on 
> Redhat Linux 7.6.
> 
> The logical replication has been in “catchup” mode for several days now, 
> stuck at a specific LSN (9EF/89ADF7E0). The slave side seems to be chugging 
> along, generating lots of WAL files — but not actually getting any new 
> records in. 
> 
> The logical replication is being used to move some fairly large files:  
> averaging about 1 MB but with a max up to about 250 MB. (I think I want to 
> fix that longer term, but that’s not a quick fix.)
> 
> My working hypothesis is that we are stuck on a long transaction: that we 
> can’t get some abnormally large blob over before we drop the line.  In this 
> case fixing the connection should fix the problem.
> 
> Is there a way I can see what is going on?  The slave is working hard, but 
> what is it working on?
> 
> And are there any suggestions on how to handle this?
> 
> I could restart the logical replication with “copy_data = false”, then fill 
> in the holes by hand.  But I would rather not!
> 
> Thanks in advance!
> 
> John
> 
> PS. Output of pg_stat_replication & pg_stat_subscription on master & slave 
> respectively.  (Some proprietary information X’d out)
> 
> select * from pg_stat_replication;
> -[ RECORD 1 ]----+------------------------------
> pid              | 42451
> usesysid         | 10
> usename          | postgres
> application_name | china_to_nj_sub
> client_addr      | XXX.XXX.XXX.XXX
> client_hostname  | 
> client_port      | 54300
> backend_start    | 2020-08-03 09:07:07.257454-04
> backend_xmin     | 16574498
> state            | catchup
> sent_lsn         | 9EF/89ADF7E0
> write_lsn        | 9EF/89ADF7E0
> flush_lsn        | 9EF/89ADF7E0
> replay_lsn       | 9EF/89ADF7E0
> write_lag        | 
> flush_lag        | 
> replay_lag       | 
> sync_priority    | 0
> sync_state       | async
> 
>  select * from pg_stat_subscription;
> -[ RECORD 1 ]---------+------------------------------
> subid                 | 7222699
> subname               | china_to_nj_sub
> pid                   | 14764
> relid                 | 
> received_lsn          | 9EF/89ADF7E0
> last_msg_send_time    | 2020-08-03 10:15:48.644575-04
> last_msg_receipt_time | 2020-08-03 10:14:57.247993-04
> latest_end_lsn        | 9EF/89ADF7E0
> latest_end_time       | 2020-08-03 09:30:57.974223-04
> 
> 
> John Ashmead
> 139 Montrose Avenue
> Rosemont, PA, 19010-1508
> (610) 527 9560 
> mobile (610) 247 2323
> john.ashm...@ashmeadsoftware.com <mailto:john.ashm...@ashmeadsoftware.com>
> 
> 
> 
> 
> 
> 

John Ashmead
139 Montrose Avenue
Rosemont, PA, 19010-1508
(610) 527 9560 
mobile (610) 247 2323
john.ashm...@ashmeadsoftware.com






Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to