Hi,

I have an error/exception in the distributed runtime. The job sometimes
works, sometimes does not.

The Web UI says:
Error: java.lang.Exception: The slot in which the task was scheduled has
been killed (probably loss of TaskManager).

Looking in the taskmanager logs, I find:
25.Apr. 00:12:42 WARN  DFSClient            - DFSOutputStream
ResponseProcessor exception  for block
BP-1944967336-172.16.21.111-1412785070309:blk_1075716650_1975996
java.io.EOFException: Premature EOF: no length prefix available
        at
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1492)
        at
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:116)
        at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:721)
25.Apr. 00:12:43 WARN  DFSClient            - DataStreamer Exception
java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
        at
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
        at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
        at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
        at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at
org.apache.hadoop.hdfs.DFSOutputStream$Packet.writeTo(DFSOutputStream.java:278)
        at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:568)
25.Apr. 00:12:42 WARN  DFSClient            - DFSOutputStream
ResponseProcessor exception  for block
BP-1944967336-172.16.21.111-1412785070309:blk_1075716642_1975988
java.io.IOException: Bad response ERROR for block
BP-1944967336-172.16.21.111-1412785070309:blk_1075716642_1975988 from
datanode 172.16.19.81:50010
        at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:732)
25.Apr. 00:12:46 WARN  DFSClient            - Error Recovery for block
BP-1944967336-172.16.21.111-1412785070309:blk_1075716650_1975996 in
pipeline 172.16.20.112:50010, 172.16.19.81:50010, 172.16.19.109:50010: bad
datanode 172.16.20.112:50010
25.Apr. 00:12:47 WARN  DFSClient            - Error Recovery for block
BP-1944967336-172.16.21.111-1412785070309:blk_1075716642_1975988 in
pipeline 172.16.20.112:50010, 172.16.19.81:50010, 172.16.20.105:50010: bad
datanode 172.16.20.112:50010
25.Apr. 00:12:48 WARN  RemoteWatcher        - Detected unreachable:
[akka.tcp://flink@172.16.21.111:6123]
25.Apr. 00:12:53 INFO  TaskManager          - Disconnecting from
JobManager: JobManager is no longer reachable
25.Apr. 00:12:53 INFO  TaskManager          - Cancelling all computations
and discarding all cached data.

The jobmanager's logs are saying:
25.Apr. 00:07:37 WARN  RemoteWatcher        - Detected unreachable:
[akka.tcp://flink@172.16.20.112:41265]
25.Apr. 00:07:37 INFO  JobManager           - Task manager akka.tcp://
flink@172.16.20.112:41265/user/taskmanager terminated.
25.Apr. 00:07:37 INFO  InstanceManager      - Unregistered task manager
akka.tcp://flink@172.16.20.112:41265. Number of registered task managers 9.
Number of available slots 18.
25.Apr. 00:07:37 INFO  JobManager           - Status of job
5f021c291483cdf7e7fae3271bfeacb1 (Wikipedia Extraction (dataset = full))
changed to FAILING The slot in which the task was scheduled has been killed
(probably loss of TaskManager)..

Any idea, what I can do? Change some config settings?

I already have:
taskmanager.heartbeat-interval: 10000
jobmanager.max-heartbeat-delay-before-failure.sec: 90

Just in case you think this might be correlated with FLINK-1916, which I
reported/posted a while ago: It's a different job, running on different
data.


Best,
Stefan

Reply via email to