He Xiaoqiao created HDFS-14527: ---------------------------------- Summary: Stop all DataNodes may result in NN terminate Key: HDFS-14527 URL: https://issues.apache.org/jira/browse/HDFS-14527 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: He Xiaoqiao Assignee: He Xiaoqiao
If we stop all datanodes of cluster, BlockPlacementPolicyDefault#chooseTarget may get ArithmeticException when calling #getMaxNodesPerRack, which throws the runtime exception out to BlockManager's ReplicationMonitor thread and then terminate the NN. The root cause is that BlockPlacementPolicyDefault#chooseTarget not hold the global lock, and if all DataNodes are dead between {{clusterMap.getNumberOfLeaves()}} and {{getMaxNodesPerRack}} then it meet {{ArithmeticException}} while invoke {{getMaxNodesPerRack}}. {code:java} private DatanodeStorageInfo[] chooseTarget(int numOfReplicas, Node writer, List<DatanodeStorageInfo> chosenStorage, boolean returnChosenNodes, Set<Node> excludedNodes, long blocksize, final BlockStoragePolicy storagePolicy, EnumSet<AddBlockFlag> addBlockFlags, EnumMap<StorageType, Integer> sTypes) { if (numOfReplicas == 0 || clusterMap.getNumOfLeaves()==0) { return DatanodeStorageInfo.EMPTY_ARRAY; } ...... int[] result = getMaxNodesPerRack(chosenStorage.size(), numOfReplicas); ...... } {code} Some detailed log show as following. {code:java} 2019-05-31 12:29:21,803 ERROR org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: ReplicationMonitor thread received Runtime exception. java.lang.ArithmeticException: / by zero at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getMaxNodesPerRack(BlockPlacementPolicyDefault.java:282) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:228) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:132) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:4533) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$1800(BlockManager.java:4493) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1954) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1830) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4453) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4388) at java.lang.Thread.run(Thread.java:745) 2019-05-31 12:29:21,805 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1 {code} To be honest, this is not serious bug and not reprod easily, since if we stop all Datanodes and only keep NameNode lives, HDFS could be not offer service normally and we could only retrieve directory. It may be one corner case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org