liuguanghua created HDFS-17886:
----------------------------------

             Summary: Fix namenode storageDirectory errors when doCheckpoint 
updateStorageVersion failed because of doCheckpoint thread interrupted when 
standby namenode ha failover to active
                 Key: HDFS-17886
                 URL: https://issues.apache.org/jira/browse/HDFS-17886
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: liuguanghua
            Assignee: liuguanghua


When namenode ha failover occurs, the standby namenode convert to active 
namenode,it will interrupt doCheckpoint thread.  There is an extremely small 
probability that doCheckpoint updateStorageVersion() will throw 
java.nio.channels.ClosedByInterruptException. It will lead to the storage 
directory errors and remove from available list.

 

The relevant error log is as follows:

2026-01-29 20:13:38,234 WARN org.apache.hadoop.hdfs.server.common.Storage: 
Error during write properties to the VERSION file to Storage Directory root= 
/data/hadoop/hdfs/namenode; location= null
java.nio.channels.ClosedByInterruptException
        at 
java.base/java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:199)
        at 
java.base/sun.nio.ch.FileChannelImpl.endBlocking(FileChannelImpl.java:162)
        at 
java.base/sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:342)
        at 
org.apache.hadoop.hdfs.server.common.Storage.writeProperties(Storage.java:1284)
        at 
org.apache.hadoop.hdfs.server.common.Storage.writeProperties(Storage.java:1263)
        at 
org.apache.hadoop.hdfs.server.common.Storage.writeProperties(Storage.java:1254)
        at 
org.apache.hadoop.hdfs.server.namenode.NNStorage.writeAll(NNStorage.java:1169)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.updateStorageVersion(FSImage.java:1106)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:1165)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:227)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1300(StandbyCheckpointer.java:64)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:480)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$600(StandbyCheckpointer.java:383)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:403)
        at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:503)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:399)
2026-01-29 20:13:38,238 ERROR org.apache.hadoop.hdfs.server.common.Storage: 
Error reported on storage directory Storage Directory root= 
/data/hadoop/hdfs/namenode; location= null
2026-01-29 20:13:38,238 WARN org.apache.hadoop.hdfs.server.common.Storage: 
About to remove corresponding storage: /data/hadoop/hdfs/namenode
2026-01-29 20:13:38,245 ERROR 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Exception in 
doCheckpoint
java.io.IOException: All the storage failed while writing properties to VERSION 
file
        at 
org.apache.hadoop.hdfs.server.namenode.NNStorage.writeAll(NNStorage.java:1175)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.updateStorageVersion(FSImage.java:1106)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:1165)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:227)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1300(StandbyCheckpointer.java:64)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:480)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$600(StandbyCheckpointer.java:383)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:403)
        at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:503)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:399)

 

And java.nio.channels.ClosedByInterruptException is not a disk errors , so  it 
should not remove from available storage list.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to