Hi, here's an update on this old thread. I've found the problem, it was because of long transactions. I reproduced the problem when I added a second publication and another postgres cluster with a subscription to it (on acceptance still).
This query came in handy to show me what was happening: select r.pid, a.state as query_state, a.wait_event, r.application_name, r.state as replication_state, -- now()-a.query_start as query_time, r.write_lag, pg_size_pretty(pg_wal_lsn_diff(r.sent_lsn,r.write_lsn)) as write_lag2, r.flush_lag, pg_size_pretty(pg_wal_lsn_diff(r.sent_lsn,r.flush_lsn)) as flush_lag2, r.replay_lag, pg_size_pretty(pg_wal_lsn_diff(r.sent_lsn,r.replay_lsn)) as replay_lag2, r.client_addr, a.backend_start, a.state_change, a.query_start, now() from pg_stat_replication r left join pg_stat_activity a on a.pid = r.pid order by r.application_name; What i saw was that replication wold "start", then enter a "catchup" state and detect about 5GB of data, wich it would then process, so i would see the "lag_..." numbers decline as i repeated that query. When the number hit 0 or a slightly negative number, it would take a minute (literally) and then the process was repeated. BTW I was seeing negative numbers at some point so i swapped the values (e.g. r.write_lsn, r.sent_lsn) and changed them back later. So I think that the 5GB was actually -5GB, strangely. During the minute wait at a near-zero lag, the query state would be "ReorderBufferWrite". In the log there was an important extra line that I hadn't noticed before, because it doesn't occur as often as all the lines about the workers exiting in an error. 2021-07-14 14:04:58.110 CEST [22782] ERROR: terminating logical replication worker due to timeout 2021-07-14 14:04:58.112 CEST [3720] LOG: background worker "logical replication worker" (PID 22782) exited with exit code 1 I was already suspecting long transactions to be the problem, so I thought that the timeout might be the problem. At that point I changed the wal_receiver_timeout option to 0 in the postgresql.conf file and the above behaviour ceased. Instead I saw the lag values increase rapidly, with about 100GB per hour (not a high traffic database). There were long transactions on the subscriber database, but merely killing those precesses wasn't enough because they would soon be started again. I had to disable the job that started them for about 15 minutes, that did the job. After that, those long transactions were not a problem anymore. A guess at what is causing this: The long transactions on the subscriber node are only a problem during the initial snapshot phase. It uses a transaction and needs heavy locks on the tables that are to be replicated. The initial snapshot was cancelled because it didn't end successfully within the wal_receiver_timeout, so it was restarted later. When i disabled the timeout, all the WAL on te server was retained. There were still periodical retries, there was no pending request for a lock. Or it would have gotten one the moment I killed that long transaction. But every time that the snapshot process was retried, there was already a new long transaction preventing the lock for the snapshot. I had to prevent long transactions for long enough so that the initial snapshot could get a lock on the subscriber. However, I don't know the length of the timeout that defines how long a transaction can be without disturbing the snapshot. Please correct me where I'm wrong. Cheers, Willy Loos On Fri, Jun 11, 2021 at 9:51 AM Willy-Bas Loos <willy...@gmail.com> wrote: > Hi, I was going to follow up on this one, sorry for the long silence. > The replication is working fine now, and I have no idea what the problem > was. Not cool. > If I find out, I will let you know. > > On Mon, May 31, 2021 at 6:06 PM Tomas Pospisek <t...@sourcepole.ch> wrote: > >> Hi Willy-Bas Loos, >> >> On 31.05.21 17:32, Willy-Bas Loos wrote: >> > >> > >> > On Mon, May 31, 2021 at 4:24 PM Vijaykumar Jain >> > <vijaykumarjain.git...@gmail.com >> > <mailto:vijaykumarjain.git...@gmail.com>> wrote: >> > >> > So I got it all wrong it seems :) >> > >> > Thank you for taking the time to help me! >> > >> > You upgraded to pg13 fine? , but while on pg13 you have issues with >> > logical replication ? >> > >> > Yes, the upgrade went fine. So here are some details: >> > I already had londiste running on postgres 9.3, but londiste wouldn't >> > run on Debian 10 >> > So i first made the new server Debian 9 with postgres 9.6 and i started >> > replicating with londiste from 9.3 to 9.6 >> > When all was ready, i stopped the replication to the 9.6 server and >> > deleted all londiste & pgq content with drop schema cascade. >> > Then I upgraded the server to Debian 10. Then i user pg_upgrade to >> > upgrade from postgres 9.6 to 13. (PostGIS versions were kept >> compatible). >> > Then I added logical replication and a third server as a subscriber. >> > >> > I was going to write that replication is working fine (since the table >> > contains a lot of data and there are no conflicts in the log), but it >> > turns out that it isn't. >> > The subscriber is behind and It looks like there hasn't been any >> > incoming data after the initial data synchronization. >> > So at least now i know that the WAL is being retained with a reason. >> The >> > connection is working properly (via psql anyway) >> >> I once maybe had a similar problem due to some ports that were needed >> for replication being firewalled off or respectively the master having >> the wrong IP address of the old master (now standby server) or such. >> >> There was absolutely no word anywhere in any log about the problem I was >> just seeing the new postgres master not starting up after hours and >> hours of waiting after a failover. I somehow found out about the >> required port being blocked (I don't remember - maybe seing the >> unanswered SYNs in tcpdump? Or via ufw log entries?). >> >> > I will also look into how to diagnose this from the system tables, e.g. >> > substracting LSN's to get some quantitative measure for the lag. >> > >> > >> > >> > There is a path in the postgresql source user subscription folder >> > iirc which covers various logical replication scenarios. >> > That may help you just in case. >> > >> > OK, so comments in the source code you mean? >> > >> >> > > -- > Willy-Bas Loos > -- Willy-Bas Loos