[ https://issues.apache.org/jira/browse/HDFS-130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Allen Wittenauer resolved HDFS-130. ----------------------------------- Resolution: Not a Problem > high rate of task failures because of bad or full datanodes > ----------------------------------------------------------- > > Key: HDFS-130 > URL: https://issues.apache.org/jira/browse/HDFS-130 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Christian Kunz > > With 0.17 we notice a fast rate of task failures because of the same bad data > nodes being reported repeatedly as badFirstLink. We never saw this in 0.16. > After running less than 20,000 map tasks, more than 2,500 of them reported a > single certain datanode as badFirstLink, with typical exception of the form: > 08/09/09 14:41:14 INFO dfs.DFSClient: Exception in createBlockOutputStream > java.net.SocketTimeoutException: 189000 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/xxx.yyy.zzz.ttt:38788 remote=/xxx.yyy.zzz.ttt:50010] > 08/09/09 14:41:14 INFO dfs.DFSClient: Abandoning block > blk_-3650954811734254315 > 08/09/09 14:41:14 INFO dfs.DFSClient: Waiting to find target node: > xxx.yyy.zzz.ttt:50010 > 08/09/09 14:44:29 INFO dfs.DFSClient: Exception in createBlockOutputStream > java.net.SocketTimeoutException: 189000 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/xxx.yyy.zzz.ttt:39014 remote=/xxx.yyy.zzz.ttt:50010] > 08/09/09 14:44:29 INFO dfs.DFSClient: Abandoning block blk_8665387817606483066 > 08/09/09 14:44:29 INFO dfs.DFSClient: Waiting to find target node: > xxx.yyy.zzz.ttt:50010 > 08/09/09 14:47:35 INFO dfs.DFSClient: Exception in createBlockOutputStream > java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010 > 08/09/09 14:47:35 INFO dfs.DFSClient: Abandoning block blk_8475261758012143524 > 08/09/09 14:47:35 INFO dfs.DFSClient: Waiting to find target node: > xxx.yyy.zzz.ttt:50010 > 08/09/09 14:50:42 INFO dfs.DFSClient: Exception in createBlockOutputStream > java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010 > 08/09/09 14:50:42 INFO dfs.DFSClient: Abandoning block blk_4847638219960634858 > 08/09/09 14:50:42 INFO dfs.DFSClient: Waiting to find target node: > xxx.yyy.zzz.ttt:50010 > 08/09/09 14:50:48 WARN dfs.DFSClient: DataStreamer Exception: > java.io.IOException: Unable to create new block. > 08/09/09 14:50:48 WARN dfs.DFSClient: Error Recovery for block > blk_4847638219960634858 bad datanode[2] > Exception in thread "main" java.io.IOException: Could not get block > locations. Aborting... > With several such bad datanodes the probability of jobs failing goes up a lot. -- This message was sent by Atlassian JIRA (v6.2#6252)