[
https://issues.apache.org/jira/browse/HBASE-29987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Duo Zhang resolved HBASE-29987.
-------------------------------
Fix Version/s: 2.7.0
3.0.0-beta-2
2.6.5
Hadoop Flags: Reviewed
Resolution: Fixed
Pushed to branch-2.6+.
Thanks [~skhillon]!
> Replication position corruption when WAL file switch detected in
> ReplicationSourceWALReader run loop
> ----------------------------------------------------------------------------------------------------
>
> Key: HBASE-29987
> URL: https://issues.apache.org/jira/browse/HBASE-29987
> Project: HBase
> Issue Type: Bug
> Components: Replication, wal, Zookeeper
> Reporter: Sid Khillon
> Assignee: Sid Khillon
> Priority: Minor
> Labels: pull-request-available
> Fix For: 2.7.0, 3.0.0-beta-2, 2.6.5
>
>
> When {{ReplicationSourceWALReader.run()}} detects a WAL file switch via the
> {{switched()}} check at line 160, it enqueues an EOF batch but does not
> update {{{}currentPosition{}}}. If the outer loop subsequently restarts
> (e.g., due to {{{}WALEntryFilterRetryableException{}}}), the new
> {{WALEntryStream}} is created with the stale position from the old WAL file,
> which gets applied to the new WAL file. This causes the reader to enter an
> infinite retry loop attempting to seek to an invalid position, permanently
> stalling replication.
>
> The {{switched()}} path at line 160 fires when {{readWALEntries()}} returns a
> batch without seeing EOF — either because batch capacity was reached, or
> because an error (e.g., NameNode timeout) caused {{hasNext()}} inside
> {{readWALEntries()}} to return RETRY, breaking the loop early. The next
> {{hasNext()}} at line 153 then detects EOF, dequeues the old file, and
> returns {{{}RETRY_IMMEDIATELY{}}}. The {{switched()}} check fires because
> {{{}currentPath{}}}(captured before {{{}hasNext(){}}}) was the old file, but
> the stream’s path is now null after the dequeue. {{currentPosition}} is not
> updated.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)