liuguanghua created HDFS-17886:
----------------------------------
Summary: Fix namenode storageDirectory errors when doCheckpoint
updateStorageVersion failed because of doCheckpoint thread interrupted when
standby namenode ha failover to active
Key: HDFS-17886
URL: https://issues.apache.org/jira/browse/HDFS-17886
Project: Hadoop HDFS
Issue Type: Bug
Reporter: liuguanghua
Assignee: liuguanghua
When namenode ha failover occurs, the standby namenode convert to active
namenode,it will interrupt doCheckpoint thread. There is an extremely small
probability that doCheckpoint updateStorageVersion() will throw
java.nio.channels.ClosedByInterruptException. It will lead to the storage
directory errors and remove from available list.
The relevant error log is as follows:
2026-01-29 20:13:38,234 WARN org.apache.hadoop.hdfs.server.common.Storage:
Error during write properties to the VERSION file to Storage Directory root=
/data/hadoop/hdfs/namenode; location= null
java.nio.channels.ClosedByInterruptException
at
java.base/java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:199)
at
java.base/sun.nio.ch.FileChannelImpl.endBlocking(FileChannelImpl.java:162)
at
java.base/sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:342)
at
org.apache.hadoop.hdfs.server.common.Storage.writeProperties(Storage.java:1284)
at
org.apache.hadoop.hdfs.server.common.Storage.writeProperties(Storage.java:1263)
at
org.apache.hadoop.hdfs.server.common.Storage.writeProperties(Storage.java:1254)
at
org.apache.hadoop.hdfs.server.namenode.NNStorage.writeAll(NNStorage.java:1169)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.updateStorageVersion(FSImage.java:1106)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:1165)
at
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:227)
at
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1300(StandbyCheckpointer.java:64)
at
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:480)
at
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$600(StandbyCheckpointer.java:383)
at
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:403)
at
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:503)
at
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:399)
2026-01-29 20:13:38,238 ERROR org.apache.hadoop.hdfs.server.common.Storage:
Error reported on storage directory Storage Directory root=
/data/hadoop/hdfs/namenode; location= null
2026-01-29 20:13:38,238 WARN org.apache.hadoop.hdfs.server.common.Storage:
About to remove corresponding storage: /data/hadoop/hdfs/namenode
2026-01-29 20:13:38,245 ERROR
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Exception in
doCheckpoint
java.io.IOException: All the storage failed while writing properties to VERSION
file
at
org.apache.hadoop.hdfs.server.namenode.NNStorage.writeAll(NNStorage.java:1175)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.updateStorageVersion(FSImage.java:1106)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:1165)
at
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:227)
at
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1300(StandbyCheckpointer.java:64)
at
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:480)
at
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$600(StandbyCheckpointer.java:383)
at
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:403)
at
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:503)
at
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:399)
And java.nio.channels.ClosedByInterruptException is not a disk errors , so it
should not remove from available storage list.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]