Re: Intermittent Issue with WAL Segment Removal in Logical Replication

Ron Johnson Thu, 28 Dec 2023 14:25:10 -0800

On Thu, Dec 28, 2023 at 4:54 PM Kaushik Iska <kaus...@peerdb.io> wrote:


> Hi all,
>
> I'm including additional details, as I am able to reproduce this issue a
> little more reliably.
>
> Postgres Version: POSTGRES_14_9.R20230830.01_07
> Vendor: Google Cloud SQL
> Logical Replication Protocol version 1
>
> Here are the logs of attempt succeeding right after it fails:
>
> 2023-12-27 01:12:40.581 UTC [59790]: [6-1] db=postgres,user=postgres
> STATEMENT:  START_REPLICATION SLOT peerflow_slot_wal_testing_2 LOGICAL
> 6/5AE67D79 (proto_version '1', publication_names
> 'peerflow_pub_wal_testing_2') <- FAILS
> 2023-12-27 01:12:41.087 UTC [59790]: [7-1] db=postgres,user=postgres
> ERROR:  requested WAL segment 000000010000000600000059 has already been
> removed
> 2023-12-27 01:12:44.581 UTC [59794]: [3-1] db=postgres,user=postgres
> STATEMENT:  START_REPLICATION SLOT peerflow_slot_wal_testing_2 LOGICAL
> 6/5AE67D79 (proto_version '1', publication_names
> 'peerflow_pub_wal_testing_2')  <- SUCCEEDS
> 2023-12-27 01:12:44.582 UTC [59794]: [4-1] db=postgres,user=postgres LOG:
>  logical decoding found consistent point at 6/5A31F050
>
> Happy to include any additional details of my setup.
>
> Thanks,
> Kaushik
>
>
> On Tue, Dec 26, 2023 at 10:36 AM Kaushik Iska <kaus...@peerdb.io> wrote:
>
>> Dear PostgreSQL Community,
>>
>> I am seeking guidance regarding a recurring issue we've encountered with
>> WAL segment removal during logical replication using pgoutput plugin. We
>> sporadically encounter an error indicating that a requested WAL segment has
>> already been removed. This issue arises intermittently when executing
>> START_REPLICATION. An example error message is as follows:
>>
>>
>> requested WAL segment 000000010000146000000AE has already been removed
>>
>>
>> Please note that this error is not specific to the segment mentioned
>> above; it serves as an example of the type of error we are experiencing.
>>
>> Additional Context:
>>
>>
>>    -
>>
>>    max_slot_wal_keep_size is -1, logical_decoding_work_mem is 4 GB.
>>    -
>>
>>    The error seems to appear randomly and is not consistent.
>>    -
>>
>>    After a couple of retries, the replication process eventually
>>    succeeds.
>>    -
>>
>>    For one of the users it seems to be happening every 16 hours or so.
>>
>>
>> Our approach involves starting with START_REPLICATION 0, replicating data
>> in batches, and then restarting at the last LSN of the previous batch. We
>> are trying to understand the root cause behind the intermittent removal of
>> WAL segments during logical replication. Specifically, we are looking for
>> insights into:
>>
>>
>>    -
>>
>>    The potential reasons for the WAL segments being reported as removed.
>>    -
>>
>>    Why this error occurs intermittently and why replication succeeds
>>    after several retries.
>>    -
>>
>>    Any advice on troubleshooting and resolving this issue, or insights
>>    into whether it might be related to our specific replication setup or a
>>    characteristic of pgoutput, would be highly valuable.
>>
>>
>> Related Posts
>>
>>
>>    -
>>
>>    https://issues.redhat.com/browse/DBZ-590
>>    -
>>
>>    Troubleshooting Postgres Sources | Airbyte Documentation
>>    
>> <https://docs.airbyte.com/integrations/sources/postgres/postgres-troubleshooting#under-cdc-incremental-mode-there-are-still-full-refresh-syncs>
>>    -
>>
>>
>>    
>> https://fivetran.com/docs/databases/postgresql/troubleshooting/last-tracked-lsn-error
>>
>>
>>
>> Thank you very much for your time and assistance.
>>
>> Thanks,
>>
>> Kaushik Iska
>>
>>
It might be interesting to see the contents of pg_replication_slots.

Re: Intermittent Issue with WAL Segment Removal in Logical Replication

Reply via email to