So I got it all wrong it seems :) You upgraded to pg13 fine? , but while on pg13 you have issues with logical replication ?
There is a path in the postgresql source user subscription folder iirc which covers various logical replication scenarios. That may help you just in case. I have tried all known ways in which replication breaks like above and once I resolve conflicts it starts fine. I’ll try to explore more scenarios. Pardon my link to pglogical. I misunderstood. On Mon, 31 May 2021 at 7:25 PM Willy-Bas Loos <willy...@gmail.com> wrote: > Thank you for elaborating those possible causes and for the suggestions > you made. > 1) if you have an inactive replication slot. > There is only 1 replication slot and it is active. So that is not the > issue. > > 2) Do you have archiving enabled? > No, i never turned it on and so this is in the settings of both publisher > and subscriber: #archive_mode = off (and show archive_mode; tells me the > same) > > 3) logical replication can be broken for multiple reasons, like conflicts > where the subscriber already has the data which primary wants to push. it > will not proceed until the conflicts are resolved. > That would have been in the log, but there isn't any conflict in the log. > Only the messages that i posted with my first message. > > 4) poor connectivity or the computer/network resources not able to keep up > with the load, can result in WAL pile up. > This would be strange since there is a 10Gbps connection within the same > rack. But it could theoretically be malfunctioning or the performance on > the subscriber could be too low. > If any of this is the case, shouldn't that be visible in > pg_stat_subscription ? > > Thanks for the article, it's interesting to see how they transitioned from > londiste, even if the article is about pglogical, not logical replication > in the postgres core. > I was using Londiste to transfer the data to the new server and minimize > downtime, so the article might come in handy. > I prepared by reading the documentation, which is very straightforward. > >btw, > >how are you doing logical replication with 9.3 ? using a pglogical > extension ? > No, I'm not using logical replication in postgres 9.3 . Only on postgres > 13. > About the link to the bug reports: Thanks for the suggestion. But first > I'd like to get some better grip on what is going on before searching for > bugs. > > Still, any help will be much appreciated > > On Sat, May 29, 2021 at 5:16 PM Vijaykumar Jain < > vijaykumarjain.git...@gmail.com> wrote: > >> WAL can be built up for reasons like >> 1) if you have an inactive replication slot. I mean you had a streaming >> replica which was turned off, but you did not remote the slot from primary. >> 2) Do you have archiving enabled? Are the archiving commands running fine >> ? if just the archiving is broken, then you can manually run >> archive cleanup provided, replication is all caught up fine. >> >> 3) logical replication can be broken for multiple reasons, like conflicts >> where the subscriber already has the data which primary wants to push. it >> will not proceed until the conflicts are resolved. >> 4) poor connectivity or the computer/network resources not able to keep >> up with the load, can result in WAL pile up. >> >> there are many blogs around logical replication issues, but when it was >> new in pg10, I read this. >> Recovery use cases for Logical Replication in PostgreSQL 10 | by >> Konstantin Evteev | AvitoTech | Medium >> <https://medium.com/avitotech/recovery-use-cases-for-logical-replication-in-postgresql-10-a1e6bab03072> >> >> btw, >> how are you doing logical replication with 9.3 ? using a pglogical >> extension ? >> I can try many things, but it would be wrong to make assumptions since i >> did not work with 9.3 >> for ex. >> Bug fix: Using ExecCopySlot during multi insert by bdrouvotAWS · Pull >> Request #295 · 2ndQuadrant/pglogical (github.com) >> <https://github.com/2ndQuadrant/pglogical/pull/295> >> there are many issues posted here that may be relevant to your setup. >> >> >> >> >> >> On Sat, 29 May 2021 at 19:22, Willy-Bas Loos <willy...@gmail.com> wrote: >> >>> Yeah, indexes could slow things down, thanks. Btw I'm not using logical >>> replication for the upgrade, that's not supported for 9.3. >>> It was more complicated but that's beside the point. >>> >>> I could just delete the publication and all that belongs to it and start >>> over. But since I'm trying out logical replication, I would like to be more >>> in control than that. It's there anything that I can dig into to find out >>> why the WAL is accumulating? >>> >>> Op vr 28 mei 2021 22:20 schreef Vijaykumar Jain < >>> vijaykumarjain.git...@gmail.com>: >>> >>>> I am not too sure with 9.3 >>>> i tried an upgrade from 9.6 to 11 using logical replication (pg_logical >>>> extension) >>>> >>>> one thing to note. >>>> logical replication initiates a copy from a snapshot, then changes from >>>> then on. >>>> >>>> I had a very high insert rate on my source tables (v9.6) and the >>>> destination (v11) could not keep up (it had tons of indexes when I copied >>>> the schema) and it took around a day as the table had around 12 indexes. >>>> >>>> So at the destination(v11), I dropped all but the primary index for >>>> each table, started subscription and when it was almost caught up, rebuilt >>>> the index on the destination concurrently. >>>> it completed in 4-5 hours without stopping the source. >>>> migration completed in a few mins :) >>>> >>>> not sure if this would help, but just FYI. >>>> >>>> >>>> On Sat, 29 May 2021 at 01:36, Willy-Bas Loos <willy...@gmail.com> >>>> wrote: >>>> >>>>> Hi , I'm upgrading a 1.5TB database from postgres 9.3 to postgres 13 >>>>> on Debian 10. This is now in an Acceptance stage (DTAP). I have >>>>> encountered >>>>> a problem: the WAL is not being deleted. I now have 1.4 TB of WAL in >>>>> pg_wal >>>>> and my disks are getting full. The oldest WAL file is 18 days old. >>>>> I use Logical Replication from the new cluster to another new cluster >>>>> with 1 subscriber and 1 subscription. >>>>> >>>>> pg_stat_subscription tells me all recent timestamps. >>>>> and this: >>>>> db=# select * from pg_replication_slots; >>>>> -[ RECORD 1 ]-------+------------- >>>>> slot_name | my_pub1 >>>>> plugin | pgoutput >>>>> slot_type | logical >>>>> datoid | 16401 >>>>> database | db >>>>> temporary | f >>>>> active | t >>>>> active_pid | 9480 >>>>> xmin | >>>>> catalog_xmin | 269168 >>>>> restart_lsn | D4/908BC268 >>>>> confirmed_flush_lsn | E1/25BF5710 >>>>> wal_status | extended >>>>> safe_wal_size | >>>>> >>>>> >>>>> >>>>> I've had problems with diskspace on this server, with postgres >>>>> crashing because of it, then added more diskspace and postgres recovered. >>>>> This doesn't seem to be a problem now. >>>>> >>>>> The *publication* has the options publish = 'insert, update, delete, >>>>> truncate', publish_via_partition_root = false >>>>> The *subscription* has the options connect = true, enabled = true, >>>>> create_slot = false, slot_name = my_pub1, synchronous_commit = 'off' >>>>> >>>>> The log on the publisher says: >>>>> 2021-05-25 21:25:18.973 CEST [4584] user@db LOG: starting logical >>>>> decoding for slot "my_pub1" >>>>> 2021-05-25 21:25:18.973 CEST [4584] user@db DETAIL: Streaming >>>>> transactions committing after D6/A82B5FE0, reading WAL from D4/908BC268. >>>>> 2021-05-25 21:25:18.973 CEST [4584] user@db LOG: logical decoding >>>>> found consistent point at D4/908BC268 >>>>> 2021-05-25 21:25:18.973 CEST [4584] user@db DETAIL: There are no >>>>> running transactions. >>>>> 2021-05-25 21:29:49.456 CEST [4614] user@db ERROR: replication slot >>>>> "my_pub1" is active for PID 4584 >>>>> 2021-05-25 21:29:54.474 CEST [4615] user@db ERROR: replication slot >>>>> "my_pub1" is active for PID 4584 >>>>> >>>>> And on the subscriber: >>>>> 2021-05-28 21:23:46.702 CEST [40039] LOG: logical replication apply >>>>> worker for subscription "my_pub1" has started >>>>> 2021-05-28 21:23:46.712 CEST [40039] ERROR: could not start WAL >>>>> streaming: ERROR: replication slot "my_pub1" is active for PID 730 >>>>> 2021-05-28 21:23:46.714 CEST [19794] LOG: background worker "logical >>>>> replication worker" (PID 40039) exited with exit code 1 >>>>> >>>>> The postgres settings on the *publisher* are: >>>>> max_connections = 100 # (change requires restart) >>>>> tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds; >>>>> shared_buffers = 50GB # min 128kB >>>>> work_mem = 1GB # min 64kB >>>>> maintenance_work_mem = 10GB # min 1MB >>>>> logical_decoding_work_mem = 5GB # min 64kB >>>>> dynamic_shared_memory_type = posix # the default is the first option >>>>> max_worker_processes = 20 # (change requires restart) >>>>> max_parallel_maintenance_workers = 10 # taken from max_parallel_workers >>>>> max_parallel_workers_per_gather = 5 # taken from max_parallel_workers >>>>> max_parallel_workers = 15 # maximum number of max_worker_processes that >>>>> wal_level = logical # minimal, replica, or logical >>>>> max_wal_size = 1GB >>>>> min_wal_size = 80MB >>>>> #archive_mode = off >>>>> max_wal_senders = 10 # max number of walsender processes >>>>> wal_sender_timeout = 1min # in milliseconds; 0 disables >>>>> max_replication_slots = 7 # max number of replication slots >>>>> >>>>> On postgres settings on the *subscriber*: >>>>> max_connections = 100 # (change requires restart) >>>>> tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds; >>>>> shared_buffers = 25GB # min 128kB >>>>> work_mem = 1GB # min 64kB >>>>> maintenance_work_mem = 10GB # min 1MB >>>>> logical_decoding_work_mem = 5GB # min 64kB >>>>> dynamic_shared_memory_type = posix # the default is the first option >>>>> max_worker_processes = 20 # (change requires restart) >>>>> max_parallel_maintenance_workers = 10 # taken from max_parallel_workers >>>>> max_parallel_workers_per_gather = 5 # taken from max_parallel_workers >>>>> max_parallel_workers = 15 # maximum number of max_worker_processes that >>>>> wal_level = logical # minimal, replica, or logical >>>>> max_wal_size = 3GB >>>>> min_wal_size = 80MB >>>>> #archive_mode = off >>>>> wal_receiver_timeout = 1min # time that receiver waits for >>>>> max_logical_replication_workers = 10 # taken from max_worker_processes >>>>> max_sync_workers_per_subscription = 5 # taken from >>>>> max_logical_replication_workers >>>>> >>>>> I've tried increasing wal_sender_timeout and wal_receiver_timeout to >>>>> 10 minutes each, but this had no positive effect. >>>>> >>>>> Some advice would be helpful >>>>> -- >>>>> Willy-Bas Loos >>>>> >>>> >>>> >>>> -- >>>> Thanks, >>>> Vijay >>>> Mumbai, India >>>> >>> >> >> -- >> Thanks, >> Vijay >> Mumbai, India >> > > > -- > Willy-Bas Loos > -- Thanks, Vijay Mumbai, India