[
https://issues.apache.org/jira/browse/HDFS-17886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Siyao Meng resolved HDFS-17886.
-------------------------------
Fix Version/s: 3.6.0
Resolution: Fixed
> Fix namenode storageDirectory errors when doCheckpoint updateStorageVersion
> failed because of doCheckpoint thread interrupted when standby namenode ha
> failover to active
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-17886
> URL: https://issues.apache.org/jira/browse/HDFS-17886
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: liuguanghua
> Assignee: liuguanghua
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.6.0
>
>
> When namenode ha failover occurs, the standby namenode convert to active
> namenode,it will interrupt doCheckpoint thread. There is an extremely small
> probability that doCheckpoint updateStorageVersion() will throw
> java.nio.channels.ClosedByInterruptException. It will lead to the storage
> directory errors and remove from available list.
>
> The relevant error log is as follows:
> 2026-01-29 20:13:38,234 WARN org.apache.hadoop.hdfs.server.common.Storage:
> Error during write properties to the VERSION file to Storage Directory root=
> /data/hadoop/hdfs/namenode; location= null
> java.nio.channels.ClosedByInterruptException
> at
> java.base/java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:199)
> at
> java.base/sun.nio.ch.FileChannelImpl.endBlocking(FileChannelImpl.java:162)
> at
> java.base/sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:342)
> at
> org.apache.hadoop.hdfs.server.common.Storage.writeProperties(Storage.java:1284)
> at
> org.apache.hadoop.hdfs.server.common.Storage.writeProperties(Storage.java:1263)
> at
> org.apache.hadoop.hdfs.server.common.Storage.writeProperties(Storage.java:1254)
> at
> org.apache.hadoop.hdfs.server.namenode.NNStorage.writeAll(NNStorage.java:1169)
> at
> org.apache.hadoop.hdfs.server.namenode.FSImage.updateStorageVersion(FSImage.java:1106)
> at
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:1165)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:227)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1300(StandbyCheckpointer.java:64)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:480)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$600(StandbyCheckpointer.java:383)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:403)
> at
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:503)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:399)
> 2026-01-29 20:13:38,238 ERROR org.apache.hadoop.hdfs.server.common.Storage:
> Error reported on storage directory Storage Directory root=
> /data/hadoop/hdfs/namenode; location= null
> 2026-01-29 20:13:38,238 WARN org.apache.hadoop.hdfs.server.common.Storage:
> About to remove corresponding storage: /data/hadoop/hdfs/namenode
> 2026-01-29 20:13:38,245 ERROR
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Exception in
> doCheckpoint
> java.io.IOException: All the storage failed while writing properties to
> VERSION file
> at
> org.apache.hadoop.hdfs.server.namenode.NNStorage.writeAll(NNStorage.java:1175)
> at
> org.apache.hadoop.hdfs.server.namenode.FSImage.updateStorageVersion(FSImage.java:1106)
> at
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:1165)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:227)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1300(StandbyCheckpointer.java:64)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:480)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$600(StandbyCheckpointer.java:383)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:403)
> at
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:503)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:399)
>
> And java.nio.channels.ClosedByInterruptException is not a disk errors , so
> it should not remove from available storage list.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]