LiuGuH opened a new pull request, #8277:
URL: https://github.com/apache/hadoop/pull/8277
Fix namenode storageDirectory errors when doCheckpoint updateStorageVersion
failed because of doCheckpoint thread interrupted when standby namenode ha
failover to active
<!--
Thanks for sending a pull request!
1. If this is your first time, please read our contributor guidelines:
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
2. Make sure your PR title starts with JIRA issue id, e.g.,
'HADOOP-17799. Your PR title ...'.
-->
### Description of PR
As Discribe of
[HDFS-17886](https://issues.apache.org/jira/browse/HDFS-17886)
When namenode ha failover occurs, the standby namenode convert to active
namenode,it will interrupt doCheckpoint thread. There is an extremely small
probability that doCheckpoint updateStorageVersion() will throw
java.nio.channels.ClosedByInterruptException. It will lead to the storage
directory errors and remove from available list.
The relevant error log is as follows:
```
2026-01-29 20:13:38,234 WARN org.apache.hadoop.hdfs.server.common.Storage:
Error during write properties to the VERSION file to Storage Directory root=
/data/hadoop/hdfs/namenode; location= null
java.nio.channels.ClosedByInterruptException
at
java.base/java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:199)
at
java.base/sun.nio.ch.FileChannelImpl.endBlocking(FileChannelImpl.java:162)
at
java.base/sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:342)
at
org.apache.hadoop.hdfs.server.common.Storage.writeProperties(Storage.java:1284)
at
org.apache.hadoop.hdfs.server.common.Storage.writeProperties(Storage.java:1263)
at
org.apache.hadoop.hdfs.server.common.Storage.writeProperties(Storage.java:1254)
at
org.apache.hadoop.hdfs.server.namenode.NNStorage.writeAll(NNStorage.java:1169)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.updateStorageVersion(FSImage.java:1106)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:1165)
at
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:227)
at
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1300(StandbyCheckpointer.java:64)
at
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:480)
at
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$600(StandbyCheckpointer.java:383)
at
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:403)
at
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:503)
at
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:399)
2026-01-29 20:13:38,238 ERROR org.apache.hadoop.hdfs.server.common.Storage:
Error reported on storage directory Storage Directory root=
/data/hadoop/hdfs/namenode; location= null
2026-01-29 20:13:38,238 WARN org.apache.hadoop.hdfs.server.common.Storage:
About to remove corresponding storage: /data/hadoop/hdfs/namenode
2026-01-29 20:13:38,245 ERROR
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Exception in
doCheckpoint
java.io.IOException: All the storage failed while writing properties to
VERSION file
at
org.apache.hadoop.hdfs.server.namenode.NNStorage.writeAll(NNStorage.java:1175)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.updateStorageVersion(FSImage.java:1106)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:1165)
at
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:227)
at
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1300(StandbyCheckpointer.java:64)
at
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:480)
at
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$600(StandbyCheckpointer.java:383)
at
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:403)
at
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:503)
at
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:399)
```
And java.nio.channels.ClosedByInterruptException is not a disk errors , so
it should not remove from available storage list.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]