[ https://issues.apache.org/jira/browse/HDFS-16111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wei-Chiu Chuang resolved HDFS-16111. ------------------------------------ Fix Version/s: 3.4.0 Resolution: Fixed Thanks! > Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes > at datanodes. > ------------------------------------------------------------------------------------------- > > Key: HDFS-16111 > URL: https://issues.apache.org/jira/browse/HDFS-16111 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Reporter: Zhihai Xu > Assignee: Zhihai Xu > Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got > failed volume on a lot of datanodes, which cause some missing blocks at that > time. Although later on we recovered all the missing blocks by symlinking the > path (dfs/dn/current) on the failed volume to a new directory and copying all > the data to the new directory, we missed our SLA and it delayed our upgrading > process on our production cluster for several hours. > When this issue happened, we saw a lot of this exceptions happened before the > volumed failed on the datanode: > [DataXceiver for client at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] > [Receiving block BP-XXXXXX-XX.XX.XX.XX-XXXXXX:blk_XXXXX_XXXXXXX]] > datanode.DataNode (BlockReceiver.java:<init>(289)) - IOException in > BlockReceiver constructor :Possible disk error: Failed to create > /XXXXXXX/dfs/dn/current/BP-XXXXXX-XX.XX.XX.XX-XXXXXXXXX/tmp/blk_XXXXXX. Cause > is > java.io.IOException: No space left on device > at java.io.UnixFileSystem.createFileExclusively(Native Method) > at java.io.File.createNewFile(File.java:1012) > at > org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302) > at > org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:212) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291) > at java.lang.Thread.run(Thread.java:748) > > We found this issue happened due to the following two reasons: > First the upgrade process added some extra disk storage on the each disk > volume of the data node: > BlockPoolSliceStorage.doUpgrade > (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java#L445) > is the main upgrade function in the datanode, it will add some extra > storage. The extra storage added is all new directories created in > /current/<bpid>/current, although all block data file and block meta data > file are hard-linked with /current/<bpid>/previous after upgrade. Since there > will be a lot of new directories created, this will use some disk space on > each disk volume. > > Second there is a potential bug when picking a disk volume to write a new > block file(replica). By default, Hadoop uses RoundRobinVolumeChoosingPolicy, > The code to select a disk will check whether the available space on the > selected disk is more than the size bytes of block file to store > (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/RoundRobinVolumeChoosingPolicy.java#L86) > But when creating a new block, there will be two files created: one is the > block file blk_XXXX, the other is block metadata file blk_XXXX_XXXX.meta, > this is the code when finalizing a block, both block file size and meta data > file size will be updated: > https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/BlockPoolSlice.java#L391 > the current code only considers the size of block file and doesn't consider > the size of block metadata file, when choosing a disk in > RoundRobinVolumeChoosingPolicy. There can be a lot of on-going blocks > received at the same time, the default maximum number of DataXceiver threads > is 4096. This will underestimate the total size needed to write a block, > which will potentially cause the above disk full error(No space left on > device). > > Since the size of the block metadata file is not fixed, I suggest to add a > configuration( > dfs.datanode.round-robin-volume-choosing-policy.additional-available-space > ) to safeguard the disk space when choosing a volume to write a new block > data in RoundRobinVolumeChoosingPolicy. > The default value can be 0 for backward compatibility. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org