On Thu, Dec 28, 2023 at 4:54 PM Kaushik Iska <kaus...@peerdb.io> wrote:
> Hi all, > > I'm including additional details, as I am able to reproduce this issue a > little more reliably. > > Postgres Version: POSTGRES_14_9.R20230830.01_07 > Vendor: Google Cloud SQL > Logical Replication Protocol version 1 > > Here are the logs of attempt succeeding right after it fails: > > 2023-12-27 01:12:40.581 UTC [59790]: [6-1] db=postgres,user=postgres > STATEMENT: START_REPLICATION SLOT peerflow_slot_wal_testing_2 LOGICAL > 6/5AE67D79 (proto_version '1', publication_names > 'peerflow_pub_wal_testing_2') <- FAILS > 2023-12-27 01:12:41.087 UTC [59790]: [7-1] db=postgres,user=postgres > ERROR: requested WAL segment 000000010000000600000059 has already been > removed > 2023-12-27 01:12:44.581 UTC [59794]: [3-1] db=postgres,user=postgres > STATEMENT: START_REPLICATION SLOT peerflow_slot_wal_testing_2 LOGICAL > 6/5AE67D79 (proto_version '1', publication_names > 'peerflow_pub_wal_testing_2') <- SUCCEEDS > 2023-12-27 01:12:44.582 UTC [59794]: [4-1] db=postgres,user=postgres LOG: > logical decoding found consistent point at 6/5A31F050 > > Happy to include any additional details of my setup. > > Thanks, > Kaushik > > > On Tue, Dec 26, 2023 at 10:36 AM Kaushik Iska <kaus...@peerdb.io> wrote: > >> Dear PostgreSQL Community, >> >> I am seeking guidance regarding a recurring issue we've encountered with >> WAL segment removal during logical replication using pgoutput plugin. We >> sporadically encounter an error indicating that a requested WAL segment has >> already been removed. This issue arises intermittently when executing >> START_REPLICATION. An example error message is as follows: >> >> >> requested WAL segment 000000010000146000000AE has already been removed >> >> >> Please note that this error is not specific to the segment mentioned >> above; it serves as an example of the type of error we are experiencing. >> >> Additional Context: >> >> >> - >> >> max_slot_wal_keep_size is -1, logical_decoding_work_mem is 4 GB. >> - >> >> The error seems to appear randomly and is not consistent. >> - >> >> After a couple of retries, the replication process eventually >> succeeds. >> - >> >> For one of the users it seems to be happening every 16 hours or so. >> >> >> Our approach involves starting with START_REPLICATION 0, replicating data >> in batches, and then restarting at the last LSN of the previous batch. We >> are trying to understand the root cause behind the intermittent removal of >> WAL segments during logical replication. Specifically, we are looking for >> insights into: >> >> >> - >> >> The potential reasons for the WAL segments being reported as removed. >> - >> >> Why this error occurs intermittently and why replication succeeds >> after several retries. >> - >> >> Any advice on troubleshooting and resolving this issue, or insights >> into whether it might be related to our specific replication setup or a >> characteristic of pgoutput, would be highly valuable. >> >> >> Related Posts >> >> >> - >> >> https://issues.redhat.com/browse/DBZ-590 >> - >> >> Troubleshooting Postgres Sources | Airbyte Documentation >> >> <https://docs.airbyte.com/integrations/sources/postgres/postgres-troubleshooting#under-cdc-incremental-mode-there-are-still-full-refresh-syncs> >> - >> >> >> >> https://fivetran.com/docs/databases/postgresql/troubleshooting/last-tracked-lsn-error >> >> >> >> Thank you very much for your time and assistance. >> >> Thanks, >> >> Kaushik Iska >> >> It might be interesting to see the contents of pg_replication_slots.