Help debugging Kafka connection leaks after job failure/cancelation

Fritz Budiyanto Tue, 26 Mar 2019 16:35:39 -0700

Hi All,

We're using Flink-1.4.2 and noticed many dangling connections to Kafka after 
job deletion/recreation. The trigger here is Job cancelation/failure due to 
network down event followed by Job recreation.


Our flink job has checkpointing disabled, and upon job failure (due to network 
failure), the Job got deleted and re-created. There were network failure event 
which impacting communication between task manager(s) and task-manager <-> 
job-manager. Our custom job controller monitored this condition and tried to 
cancel the job, followed by recreating the job (after a minute or so).

Because of the network failure, the above steps were repeated many times and 
eventually the flink-docker-container's socket file descriptors were exhausted.
Looks like there were many Kafka connections from flink-task-manager to the 
local Kafka broker.

netstat  -ntap | grep 9092 | grep java | wc -l
2235

Is this a known issue which already fixed in later release ? If yes, could 
someone point out the Jira link?
If this is a new issue, could someone let me know how to move forward and debug 
this issue ? Looks like kafka consumers were not cleaned up properly upon job 
cancelation.

Thanks,
Fritz

Help debugging Kafka connection leaks after job failure/cancelation

Reply via email to