[ 
https://issues.apache.org/jira/browse/HDFS-17886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siyao Meng resolved HDFS-17886.
-------------------------------
    Fix Version/s: 3.6.0
       Resolution: Fixed

> Fix namenode storageDirectory errors when doCheckpoint updateStorageVersion 
> failed because of doCheckpoint thread interrupted when standby namenode ha 
> failover to active
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-17886
>                 URL: https://issues.apache.org/jira/browse/HDFS-17886
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: liuguanghua
>            Assignee: liuguanghua
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.6.0
>
>
> When namenode ha failover occurs, the standby namenode convert to active 
> namenode,it will interrupt doCheckpoint thread.  There is an extremely small 
> probability that doCheckpoint updateStorageVersion() will throw 
> java.nio.channels.ClosedByInterruptException. It will lead to the storage 
> directory errors and remove from available list.
>  
> The relevant error log is as follows:
> 2026-01-29 20:13:38,234 WARN org.apache.hadoop.hdfs.server.common.Storage: 
> Error during write properties to the VERSION file to Storage Directory root= 
> /data/hadoop/hdfs/namenode; location= null
> java.nio.channels.ClosedByInterruptException
>         at 
> java.base/java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:199)
>         at 
> java.base/sun.nio.ch.FileChannelImpl.endBlocking(FileChannelImpl.java:162)
>         at 
> java.base/sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:342)
>         at 
> org.apache.hadoop.hdfs.server.common.Storage.writeProperties(Storage.java:1284)
>         at 
> org.apache.hadoop.hdfs.server.common.Storage.writeProperties(Storage.java:1263)
>         at 
> org.apache.hadoop.hdfs.server.common.Storage.writeProperties(Storage.java:1254)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NNStorage.writeAll(NNStorage.java:1169)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.updateStorageVersion(FSImage.java:1106)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:1165)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:227)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1300(StandbyCheckpointer.java:64)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:480)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$600(StandbyCheckpointer.java:383)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:403)
>         at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:503)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:399)
> 2026-01-29 20:13:38,238 ERROR org.apache.hadoop.hdfs.server.common.Storage: 
> Error reported on storage directory Storage Directory root= 
> /data/hadoop/hdfs/namenode; location= null
> 2026-01-29 20:13:38,238 WARN org.apache.hadoop.hdfs.server.common.Storage: 
> About to remove corresponding storage: /data/hadoop/hdfs/namenode
> 2026-01-29 20:13:38,245 ERROR 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Exception in 
> doCheckpoint
> java.io.IOException: All the storage failed while writing properties to 
> VERSION file
>         at 
> org.apache.hadoop.hdfs.server.namenode.NNStorage.writeAll(NNStorage.java:1175)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.updateStorageVersion(FSImage.java:1106)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:1165)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:227)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1300(StandbyCheckpointer.java:64)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:480)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$600(StandbyCheckpointer.java:383)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:403)
>         at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:503)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:399)
>  
> And java.nio.channels.ClosedByInterruptException is not a disk errors , so  
> it should not remove from available storage list.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to