[ 
https://issues.apache.org/jira/browse/FLINK-11914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16803209#comment-16803209
 ] 

Shuyi Chen commented on FLINK-11914:
------------------------------------

Hi [~Zentol], thanks a lot for the comments. cc [~till.rohrmann], since we had 
some offline discussion as well.

Currently, the YARN resource scheduler does not take into dynamic resource 
usage. Also over time, the resource usage of some containers might increase or 
some containers might use more than what they ask for, thus, oversubscribe host 
resource. Also, the resource that causing lags might be 
CPU/memory/FD/Disk/network, or even some application specific cause. This 
commonly happen in a shared cluster, and it’s not possible for the resource 
scheduler to predict and regulate the runtime resource usage effectively. Like 
other frameworks, like MapReduce or Spark, if there is a straggle task, it 
should be the responsibility of the framework to restart the straggle task in a 
different node, but not the resource scheduler, since the resource schedule has 
no idea what it means for one container to be slow.

I think exposing an endpoint to disconnect TM will enable us to build external 
monitor/controller to recover the flink job by relocating the straggling TM. 
The external controller will synthesize information from Flink metrics, 
application metrics and host metrics to determine whether a TM is straggling 
and relocate it. This will greatly help scale our platform to manage more Flink 
jobs.

Also, you are correct that it's possible that the same slow host get allocated 
again after the kill. To mitigate the issue, I propose we can add a reason 
parameter for the API and let the Flink resource scheduler to blacklist that 
host from the resource acquisition from YARN/Mesos. 

With regards to adding a UI button for this, I understand your concern and we 
can discuss the need in follow-up. 

> Expose a REST endpoint in JobManager to kill specific TaskManager
> -----------------------------------------------------------------
>
>                 Key: FLINK-11914
>                 URL: https://issues.apache.org/jira/browse/FLINK-11914
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / REST
>            Reporter: Shuyi Chen
>            Assignee: Shuyi Chen
>            Priority: Major
>
> we want to add capability in the Flink web UI to kill each individual TM by 
> clicking a button, this would require first exposing the capability from the 
> REST API endpoint. The reason is that  some TM might be running on a heavily 
> loaded YARN host over time, and we want to kill just that TM and have flink 
> JM to reallocate a TM to restart the job graph. The other approach would be 
> restart the entire YARN job and this is heavy-weight.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to