Hi guys,

Out team is observing a stability issue on our Standalone Flink clusters.

Background: The kafka cluster our flink jobs read from/ write to have some
issues and every 10 to15 mins one of the partition leaders switch. This
causes jobs that write to/ read from that topic fail and restart. Usually
this is not a problem since the jobs can restart and work with the new
partition leader. However, one of those restarts can make the jobs enter a
failing state and never be able to recover.

In the failing state. The jobmanager has exception: 

org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
Could not allocate all requires slots within timeout of 300000 ms. Slots
required: 24, slots allocated: 12

During that time, 2 of the taskmanager are reporting that all the slots on
them are occupied, however, from the dashboard of the jobmanager, no job is
deployed to those 2 taskmanagers.

My guesstimation is that since the jobs restart fairly frequently, one of
the times the slots are not released properly when jobs failed, resulting in
the jobmanager falsely believing that those 2 taskmanagers' slots are still
occupied.

It does sound like an issue mentioned in 
https://issues.apache.org/jira/browse/FLINK-9932
but we are using 1.6.2 and according to the jira ticket, this bug is fixed
in 1.6.2

Please let me know if you have any ideas or how we can prevent it. Thank you
so much!




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply via email to