Hi, It would be helpful for understanding the problem if you could share the logs.
Thank you~ Xintong Song On Wed, Jan 15, 2020 at 12:23 AM burgesschen <tchen...@bloomberg.net> wrote: > Hi guys, > > Out team is observing a stability issue on our Standalone Flink clusters. > > Background: The kafka cluster our flink jobs read from/ write to have some > issues and every 10 to15 mins one of the partition leaders switch. This > causes jobs that write to/ read from that topic fail and restart. Usually > this is not a problem since the jobs can restart and work with the new > partition leader. However, one of those restarts can make the jobs enter a > failing state and never be able to recover. > > In the failing state. The jobmanager has exception: > > org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: > Could not allocate all requires slots within timeout of 300000 ms. Slots > required: 24, slots allocated: 12 > > During that time, 2 of the taskmanager are reporting that all the slots on > them are occupied, however, from the dashboard of the jobmanager, no job is > deployed to those 2 taskmanagers. > > My guesstimation is that since the jobs restart fairly frequently, one of > the times the slots are not released properly when jobs failed, resulting > in > the jobmanager falsely believing that those 2 taskmanagers' slots are still > occupied. > > It does sound like an issue mentioned in > https://issues.apache.org/jira/browse/FLINK-9932 > but we are using 1.6.2 and according to the jira ticket, this bug is fixed > in 1.6.2 > > Please let me know if you have any ideas or how we can prevent it. Thank > you > so much! > > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ >