Hi, I have an error/exception in the distributed runtime. The job sometimes works, sometimes does not.
The Web UI says: Error: java.lang.Exception: The slot in which the task was scheduled has been killed (probably loss of TaskManager). Looking in the taskmanager logs, I find: 25.Apr. 00:12:42 WARN DFSClient - DFSOutputStream ResponseProcessor exception for block BP-1944967336-172.16.21.111-1412785070309:blk_1075716650_1975996 java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1492) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:116) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:721) 25.Apr. 00:12:43 WARN DFSClient - DataStreamer Exception java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487) at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) at java.io.DataOutputStream.write(DataOutputStream.java:107) at org.apache.hadoop.hdfs.DFSOutputStream$Packet.writeTo(DFSOutputStream.java:278) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:568) 25.Apr. 00:12:42 WARN DFSClient - DFSOutputStream ResponseProcessor exception for block BP-1944967336-172.16.21.111-1412785070309:blk_1075716642_1975988 java.io.IOException: Bad response ERROR for block BP-1944967336-172.16.21.111-1412785070309:blk_1075716642_1975988 from datanode 172.16.19.81:50010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:732) 25.Apr. 00:12:46 WARN DFSClient - Error Recovery for block BP-1944967336-172.16.21.111-1412785070309:blk_1075716650_1975996 in pipeline 172.16.20.112:50010, 172.16.19.81:50010, 172.16.19.109:50010: bad datanode 172.16.20.112:50010 25.Apr. 00:12:47 WARN DFSClient - Error Recovery for block BP-1944967336-172.16.21.111-1412785070309:blk_1075716642_1975988 in pipeline 172.16.20.112:50010, 172.16.19.81:50010, 172.16.20.105:50010: bad datanode 172.16.20.112:50010 25.Apr. 00:12:48 WARN RemoteWatcher - Detected unreachable: [akka.tcp://flink@172.16.21.111:6123] 25.Apr. 00:12:53 INFO TaskManager - Disconnecting from JobManager: JobManager is no longer reachable 25.Apr. 00:12:53 INFO TaskManager - Cancelling all computations and discarding all cached data. The jobmanager's logs are saying: 25.Apr. 00:07:37 WARN RemoteWatcher - Detected unreachable: [akka.tcp://flink@172.16.20.112:41265] 25.Apr. 00:07:37 INFO JobManager - Task manager akka.tcp:// flink@172.16.20.112:41265/user/taskmanager terminated. 25.Apr. 00:07:37 INFO InstanceManager - Unregistered task manager akka.tcp://flink@172.16.20.112:41265. Number of registered task managers 9. Number of available slots 18. 25.Apr. 00:07:37 INFO JobManager - Status of job 5f021c291483cdf7e7fae3271bfeacb1 (Wikipedia Extraction (dataset = full)) changed to FAILING The slot in which the task was scheduled has been killed (probably loss of TaskManager).. Any idea, what I can do? Change some config settings? I already have: taskmanager.heartbeat-interval: 10000 jobmanager.max-heartbeat-delay-before-failure.sec: 90 Just in case you think this might be correlated with FLINK-1916, which I reported/posted a while ago: It's a different job, running on different data. Best, Stefan