[ https://issues.apache.org/jira/browse/FLINK-11914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16803209#comment-16803209 ]
Shuyi Chen commented on FLINK-11914: ------------------------------------ Hi [~Zentol], thanks a lot for the comments. cc [~till.rohrmann], since we had some offline discussion as well. Currently, the YARN resource scheduler does not take into dynamic resource usage. Also over time, the resource usage of some containers might increase or some containers might use more than what they ask for, thus, oversubscribe host resource. Also, the resource that causing lags might be CPU/memory/FD/Disk/network, or even some application specific cause. This commonly happen in a shared cluster, and it’s not possible for the resource scheduler to predict and regulate the runtime resource usage effectively. Like other frameworks, like MapReduce or Spark, if there is a straggle task, it should be the responsibility of the framework to restart the straggle task in a different node, but not the resource scheduler, since the resource schedule has no idea what it means for one container to be slow. I think exposing an endpoint to disconnect TM will enable us to build external monitor/controller to recover the flink job by relocating the straggling TM. The external controller will synthesize information from Flink metrics, application metrics and host metrics to determine whether a TM is straggling and relocate it. This will greatly help scale our platform to manage more Flink jobs. Also, you are correct that it's possible that the same slow host get allocated again after the kill. To mitigate the issue, I propose we can add a reason parameter for the API and let the Flink resource scheduler to blacklist that host from the resource acquisition from YARN/Mesos. With regards to adding a UI button for this, I understand your concern and we can discuss the need in follow-up. > Expose a REST endpoint in JobManager to kill specific TaskManager > ----------------------------------------------------------------- > > Key: FLINK-11914 > URL: https://issues.apache.org/jira/browse/FLINK-11914 > Project: Flink > Issue Type: New Feature > Components: Runtime / REST > Reporter: Shuyi Chen > Assignee: Shuyi Chen > Priority: Major > > we want to add capability in the Flink web UI to kill each individual TM by > clicking a button, this would require first exposing the capability from the > REST API endpoint. The reason is that some TM might be running on a heavily > loaded YARN host over time, and we want to kill just that TM and have flink > JM to reallocate a TM to restart the job graph. The other approach would be > restart the entire YARN job and this is heavy-weight. -- This message was sent by Atlassian JIRA (v7.6.3#76005)