On Mon, Apr 15, 2024 at 2:53 AM Nicolas Seinlet <nico...@seinlet.com> wrote:
> Hello everyone, > > Since I moved some clusters from PostgreSQL 12 to 14, I noticed random > failures in streaming replication. I say "random" mostly because I haven't > got the source of the issue. > > I'm using the Ubuntu/cyphered ZFS/PostgreSQL combination. I'm using Ubuntu > LTS (20.04 22.04) and provided ZFS/PostgreSQL with LTS (PostgreSQL 12 on > Ubuntu 20.04 and 14 on 22.04). > > The streaming replication of PostgreSQL is configured with > `primary_conninfo 'host=main_server port=5432 user=replicant > password=a_very_secure_password sslmode=require > application_name=replication_postgresql_app' ` , no replication slot nor > restore command, and the wal is configured with `full_page_writes = off > wal_init_zero = off wal_recycle = off` > > If this works like a charm on PostgreSQL 12, it's sometimes failing with > PostgreSQL 14. As we also changed the OS, maybe the issue relies somewhere > else. > > When the issue is detected, the WAL on the primary is correct. A piece of > the WAL is wrong on the secondary. Only some bytes. Some bytes later, the > wal is again correct. Stopping PostgreSQL on the secondary, removing the > wrong WAL file, and restarting PostgreSQL solves the issue. > > We've added another secondary and noticed the issue can appear on one of > the secondaries, not both at the same time. > > What can I do to detect the origin of this issue? > 1. Minor version number? 2. Using replication_slots? 3. Error message(s)?