Re: Logical replication from 11.x to 12.x and "unique key violations"
I have had this problem with logical replication on PG 10 repeatedly. In a clean build no problem. But if I am restarting replication because of some problem I’ve seen problems with rows already present. My own fix, which has worked in my shop, is to add replica triggers to check for the row being already present. If it is, they drop the row on the floor. This lets stuff come in in whatever order it happens to come in. Sample code: if TG_OP = ‘INSERT’ then select id into id1 from table1 where id = new.id; if id1 is not null then — log error for analysis return null; end if; end if; In an ideal world, this would probably not be necessary. But it can be tricky to restart replication in an absolutely clean way across all tables and in this case it can be better to allow for a bit of overlap in the rows being replicated. FWIW, John > On Jul 20, 2020, at 1:47 PM, Tom Lane wrote: > > Thomas Kellerer writes: >>> I have a strange error when using logical replication between a 11.2 >>> source database and a 12.3 target. >>> >>> If I create the publication with all needed tables (about 50) at >>> once, I get "duplicate key value violates unique constraint xxx_pkey" >>> errors during the initial replication (when creating the >>> subscription). >>> >>> When create the publication only with a few tables, the initial data >>> sync works without problems. To replicate all tables, I add the >>> tables incrementally to the publication, and refresh the >>> subscription. >>> >>> If I do it like that (step-by-step) everything works fine. Tables >>> that generated the "duplicate key value" error previously will >>> replicate just fine. The tables are quite small, some of them less >>> then 100 rows. > > I have not looked at the code, but it wouldn't surprise me if the initial > replication just copies all the specified tables in some random order. > If there are FK references involved, the replication would have to be > done with referenced tables first, and I bet there's no logic for that. > (Even if there was, it could not cope with circular references or > self-references.) > > Best bet might be to not install the subscriber's foreign key > constraints till after the initial sync is done. > > regards, tom lane > > > John Ashmead 139 Montrose Avenue Rosemont, PA, 19010-1508 (610) 527 9560 mobile (610) 247 2323 john.ashm...@ashmeadsoftware.com smime.p7s Description: S/MIME cryptographic signature
How can you find out what point logical replication is at?
I have logical replication setup from a factory in Zhuhai China to a data warehouse in New Jersey. We are using postgresql 10.13 on both sides, on Redhat Linux 7.6. The logical replication has been in “catchup” mode for several days now, stuck at a specific LSN (9EF/89ADF7E0). The slave side seems to be chugging along, generating lots of WAL files — but not actually getting any new records in. The logical replication is being used to move some fairly large files: averaging about 1 MB but with a max up to about 250 MB. (I think I want to fix that longer term, but that’s not a quick fix.) My working hypothesis is that we are stuck on a long transaction: that we can’t get some abnormally large blob over before we drop the line. In this case fixing the connection should fix the problem. Is there a way I can see what is going on? The slave is working hard, but what is it working on? And are there any suggestions on how to handle this? I could restart the logical replication with “copy_data = false”, then fill in the holes by hand. But I would rather not! Thanks in advance! John PS. Output of pg_stat_replication & pg_stat_subscription on master & slave respectively. (Some proprietary information X’d out) select * from pg_stat_replication; -[ RECORD 1 ]+-- pid | 42451 usesysid | 10 usename | postgres application_name | china_to_nj_sub client_addr | XXX.XXX.XXX.XXX client_hostname | client_port | 54300 backend_start| 2020-08-03 09:07:07.257454-04 backend_xmin | 16574498 state| catchup sent_lsn | 9EF/89ADF7E0 write_lsn| 9EF/89ADF7E0 flush_lsn| 9EF/89ADF7E0 replay_lsn | 9EF/89ADF7E0 write_lag| flush_lag| replay_lag | sync_priority| 0 sync_state | async select * from pg_stat_subscription; -[ RECORD 1 ]-+-- subid | 7222699 subname | china_to_nj_sub pid | 14764 relid | received_lsn | 9EF/89ADF7E0 last_msg_send_time| 2020-08-03 10:15:48.644575-04 last_msg_receipt_time | 2020-08-03 10:14:57.247993-04 latest_end_lsn| 9EF/89ADF7E0 latest_end_time | 2020-08-03 09:30:57.974223-04 John Ashmead 139 Montrose Avenue Rosemont, PA, 19010-1508 (610) 527 9560 mobile (610) 247 2323 john.ashm...@ashmeadsoftware.com <mailto:john.ashm...@ashmeadsoftware.com> smime.p7s Description: S/MIME cryptographic signature
Re: How can you find out what point logical replication is at? -- or weird, slow, infinite loop
This has gotten a bit weirder. The replication is getting up to a specific LSN. Then it moves the sent_lsn to a number less than that, for a WAL file about an hour or more earlier and slowly walks forward again till it hit the same highwater mark. So the replication seems to be stuck in some kind of a slow infinite loop. Corrupted WAL file? Any other ideas of what to look for? TIA, John > On Aug 3, 2020, at 10:38 AM, John Ashmead > wrote: > > I have logical replication setup from a factory in Zhuhai China to a data > warehouse in New Jersey. We are using postgresql 10.13 on both sides, on > Redhat Linux 7.6. > > The logical replication has been in “catchup” mode for several days now, > stuck at a specific LSN (9EF/89ADF7E0). The slave side seems to be chugging > along, generating lots of WAL files — but not actually getting any new > records in. > > The logical replication is being used to move some fairly large files: > averaging about 1 MB but with a max up to about 250 MB. (I think I want to > fix that longer term, but that’s not a quick fix.) > > My working hypothesis is that we are stuck on a long transaction: that we > can’t get some abnormally large blob over before we drop the line. In this > case fixing the connection should fix the problem. > > Is there a way I can see what is going on? The slave is working hard, but > what is it working on? > > And are there any suggestions on how to handle this? > > I could restart the logical replication with “copy_data = false”, then fill > in the holes by hand. But I would rather not! > > Thanks in advance! > > John > > PS. Output of pg_stat_replication & pg_stat_subscription on master & slave > respectively. (Some proprietary information X’d out) > > select * from pg_stat_replication; > -[ RECORD 1 ]+-- > pid | 42451 > usesysid | 10 > usename | postgres > application_name | china_to_nj_sub > client_addr | XXX.XXX.XXX.XXX > client_hostname | > client_port | 54300 > backend_start| 2020-08-03 09:07:07.257454-04 > backend_xmin | 16574498 > state| catchup > sent_lsn | 9EF/89ADF7E0 > write_lsn| 9EF/89ADF7E0 > flush_lsn| 9EF/89ADF7E0 > replay_lsn | 9EF/89ADF7E0 > write_lag| > flush_lag| > replay_lag | > sync_priority| 0 > sync_state | async > > select * from pg_stat_subscription; > -[ RECORD 1 ]-+-- > subid | 7222699 > subname | china_to_nj_sub > pid | 14764 > relid | > received_lsn | 9EF/89ADF7E0 > last_msg_send_time| 2020-08-03 10:15:48.644575-04 > last_msg_receipt_time | 2020-08-03 10:14:57.247993-04 > latest_end_lsn| 9EF/89ADF7E0 > latest_end_time | 2020-08-03 09:30:57.974223-04 > > > John Ashmead > 139 Montrose Avenue > Rosemont, PA, 19010-1508 > (610) 527 9560 > mobile (610) 247 2323 > john.ashm...@ashmeadsoftware.com <mailto:john.ashm...@ashmeadsoftware.com> > > > > > > John Ashmead 139 Montrose Avenue Rosemont, PA, 19010-1508 (610) 527 9560 mobile (610) 247 2323 john.ashm...@ashmeadsoftware.com smime.p7s Description: S/MIME cryptographic signature