[jira] [Resolved] (HBASE-29987) Replication position corruption when WAL file switch detected in ReplicationSourceWALReader run loop

Duo Zhang (Jira) Thu, 12 Mar 2026 21:10:05 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-29987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Duo Zhang resolved HBASE-29987.
-------------------------------
    Fix Version/s: 2.7.0
                   3.0.0-beta-2
                   2.6.5
     Hadoop Flags: Reviewed
       Resolution: Fixed

Pushed to branch-2.6+.

Thanks [~skhillon]!

> Replication position corruption when WAL file switch detected in 
> ReplicationSourceWALReader run loop
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-29987
>                 URL: https://issues.apache.org/jira/browse/HBASE-29987
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication, wal, Zookeeper
>            Reporter: Sid Khillon
>            Assignee: Sid Khillon
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 2.7.0, 3.0.0-beta-2, 2.6.5
>
>
> When {{ReplicationSourceWALReader.run()}} detects a WAL file switch via the 
> {{switched()}} check at line 160, it enqueues an EOF batch but does not 
> update {{{}currentPosition{}}}. If the outer loop subsequently restarts 
> (e.g., due to {{{}WALEntryFilterRetryableException{}}}), the new 
> {{WALEntryStream}} is created with the stale position from the old WAL file, 
> which gets applied to the new WAL file. This causes the reader to enter an 
> infinite retry loop attempting to seek to an invalid position, permanently 
> stalling replication.
>  
> The {{switched()}} path at line 160 fires when {{readWALEntries()}} returns a 
> batch without seeing EOF — either because batch capacity was reached, or 
> because an error (e.g., NameNode timeout) caused {{hasNext()}} inside 
> {{readWALEntries()}} to return RETRY, breaking the loop early. The next 
> {{hasNext()}} at line 153 then detects EOF, dequeues the old file, and 
> returns {{{}RETRY_IMMEDIATELY{}}}. The {{switched()}} check fires because 
> {{{}currentPath{}}}(captured before {{{}hasNext(){}}}) was the old file, but 
> the stream’s path is now null after the dequeue. {{currentPosition}} is not 
> updated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (HBASE-29987) Replication position corruption when WAL file switch detected in ReplicationSourceWALReader run loop

Reply via email to