[jira] [Commented] (HDFS-17821) Fix the SNN repeatedly checkpoint after fsimage transfer failure on one of the multiple NNs

ASF GitHub Bot (Jira) Fri, 29 Aug 2025 02:25:11 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-17821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18016980#comment-18016980
 ]


ASF GitHub Bot commented on HDFS-17821:
---------------------------------------

lfxy commented on PR #7876:
URL: https://github.com/apache/hadoop/pull/7876#issuecomment-3236362394

   @tomscut @Hexiaoqiao Hi, could you help to review this issue?
   Currently, in a cluster with observers, SNN needs to upload fsimage to 
multiple NNs. If one of the uploads fails, SNN will repeatedly checkpoint and 
fail because the lastCheckpointTime of different NNs is inconsistent. Thank you.




> Fix the SNN repeatedly checkpoint after fsimage transfer failure on one of 
> the multiple NNs
> -------------------------------------------------------------------------------------------
>
>                 Key: HDFS-17821
>                 URL: https://issues.apache.org/jira/browse/HDFS-17821
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 3.5.0
>            Reporter: caozhiqiang
>            Assignee: caozhiqiang
>            Priority: Major
>              Labels: pull-request-available
>
> In our cluster with observer NNs, when the standby NN is doing a checkpoint 
> and sending the fsimage to other NNs, if the sending fails of one NN due to 
> network anomalies, NN restarts, or other exceptions, the standby will 
> consider this Checkpoint as failed and does not update the 
> lastCheckpointTime, and retry checkpoints. 
> However, the active or observer NNs which successfully received the fsimage 
> has update their lastCheckpointTime, and the NN which receive fsimage failed 
> don't update its lastCheckpointTime, resulting in inconsistent 
> lastCheckpointTime across the NNs. This causes subsequent checkpoints to 
> repeatedly fail to send fsimage to part or all active or observer NNs, 
> because they do not satisfy the DFS_NAMENODE_CHECKPOINT_PERIOD_KEY condition. 
> Then the SNN will always failed to do checkpoint and repeat retry. I think 
> that the SNN should consider the checkpoint successful and update its 
> lastCheckpointTime if the fsimage transmission succeeds on at least half of 
> the NNs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17821) Fix the SNN repeatedly checkpoint after fsimage transfer failure on one of the multiple NNs

Reply via email to