Hi,
Has anyone ever experienced the Kafka producer getting stuck in cancelling?
I am aware that there were problems with the Kafka consumer before but I
haven't seen this one yet. It happened simultaneously to 3 of my jobs last
night, they were stuck from about 8 pm to 8 am (not exact times but you get
the length.).
The logs don't seem to be very helpful on the JobManager, they just show
that all tasks start cancelling and then go cancelled except for one Kafka
sink task. That goes into cancelling but only gets cancelled 12 hours
later. On one of the task managers I have found this though:
2016-11-21 20:22:52,220 INFO org.apache.flink.yarn.YarnTaskManager
- Un-registering task and sending final execution
state CANCELED to JobManager for task Execute EventProcessors
(f030e71787a6dbd7a543e9745c42289d)
2016-11-22 08:49:35,181 WARN org.apache.kafka.common.network.Selector
- Error in I/O with
kafka17.sto.midasplayer.com/172.25.82.212
java.io.EOFException
at
org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:62)
at org.apache.kafka.common.network.Selector.poll(Selector.java:248)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:192)
at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:191)
at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:135)
at java.lang.Thread.run(Thread.java:745)
2016-11-22 08:49:35,183 INFO
org.apache.flink.runtime.taskmanager.Task - Sink:
Kafka output (2/8) switched to CANCELED
There might have been some network/kafka issue that caused 3 jobs to get
stuck at the same time but I don't know what actually happened.
Any ideas?
Gyula